I have a dataset as the following where the first and second columns indicate nodes connection from to:
fromNode toNode
0 1
0 2
0 31
0 73
1 3
1 56
2 10
...
I want to generate laplacian matrix from this dataset. I use the following code to do so but it complains as the dataset itself is not square matrix. Is there a function that accept this type of dataset and generates the matrix?
from numpy import genfromtxt
from scipy.sparse import csgraph
import csv
G = genfromtxt('./data.csv', delimiter='\t').astype(int)
dataset = csgraph.laplacian(G, normed=False)
Rather than find a function that will except your data, process your data into the correct format.
Fake data f simulates a file object. Use io.StringIO for Python 3.6.
data = '''0 1
0 2
0 31
0 73
1 3
1 56
2 10'''
f = io.BytesIO(data)
Read each line of the data and process it into a list of edges with the form (node1, node1).
edges = []
for line in f:
line = line.strip()
(node1, node2) = map(int, line.split())
edges.append((node1,node2))
Find the highest node number, create a square numpy ndarray based on the highest node number. You need to be aware of your node numbering - is it zero based?
N = max(x for edge in edges for x in edge)
G = np.zeros((N+1,N+1), dtype = np.int64)
Iterate over the edges and assign the edge weight to the Graph
for row, column in edges:
G[row,column] = 1
Here is a solution making use of numpy integer array indexing.
z = np.genfromtxt(f, dtype = np.int64)
n = z.max() + 1
g = np.zeros((n,n), dtype = np.int64)
rows, columns = z.T
g[rows, columns] = 1
Of course both of those assume all edge weights are equal.
See Graph Representations in the scipy docs. I couldn't try this graph to see if it is valid, I'm getting an import error for csgraph - probably need to update.
Related
I am trying to create a range of signals of different frequencies. I am finding it difficult to store amplitude vs time into another storage matrix for each frequency ranging from 0 to 50 Hz. Example, for a frequency of 20 Hz, I want to store the amplitude vs time for that frequency, then for 21 Hz I want to store the amplitude vs time for that frequency etc, until I have all of them in a large matrix. I am getting so confused at this point with indexing and syntax, any help welcome!
import numpy as np
max_freq = 50
s_frequency = np.arange(0,51,0.1)
fs = 200
time = np.arange(0,5-(1/fs),(1/fs))
x = np.empty((len(time)), dtype=np.float32)
i = 0
j = 0
full_array = np.empty((len(s_frequency),len(time),len(time)), dtype=np.float32)
amplitude = np.zeros(999)
for f1 in s_frequency:
i = 0
for t in time:
amplitude[i] = np.sin(2*np.pi*f1*t)
i = i + 1
full_array[i] = ([time], [amplitude])
I have also tried the following:
import numpy as np
max_freq = 50
s_frequency = np.arange(0,50.1,0.1)
fs = 200
time = np.arange(0,5-(1-fs),(1/fs))
#full_array = np.sin(2*np.pi*np.outer(s_frequency,time))
full_array = np.empty((len(s_frequency),len(time), len(time)), dtype=np.float32)
for f1 in s_frequency:
array = []
for i, t in enumerate(time):
amplitude = np.sin(2*np.pi*f1*t)
array.insert(i,amplitude)
full_array[i] = [time, array]
Not 100% sure what you're trying to do, but it seems like you're trying to initialize a 2-dimensional grid (i.e. a matrix) where you have a dimension for time and one for frequency. Here is what I would do:
import numpy as np
max_freq = 50
s_frequency = np.arange(0,51,0.1)
fs = 200
time = np.arange(0,5-(1/fs),(1/fs))
full_array = np.sin(2*np.pi*np.outer(s_frequency,time))
No explicit for-loops or index handling needed. np.outer() will give you a 2D grid (i.e. a matrix) of frequency versus time. Now whats left is to compute the sine of 2 Pi times that grid value. Very conveniently numpy functions do accept arrays as input, thus we can simply call np.sin(2*np.pi*np.outer(s_frequency,time).
Not sure what x and j are good for in your code and why full_array should be 3-diemsional. Would you like to include a spatial component as well?
By the way, a construct like this:
i = 0
for t in time:
amplitude[i] = np.sin(2*np.pi*f1*t)
i = i + 1
can easily be avoided in python, thanks to pythons build-in enumerate() function. It would then look like this:
for i, t in enumerate(time):
amplitude[i] = np.sin(2*np.pi*f1*t)
which does essentially the same, but you don't have to explicitly create the index i = 0 and manually incerement it in every iteration i = i + 1.
I have an algorithm I want to implement and I'm trying to figure out the best way to do it.
I have a matrix H of size mxn (m - number of last inputs - sliding window, n - number of attributes).
I have a set of attributes A, and I want to find correlations between the attributes.
My problem is how can I tag a matrix column/line with a name?
this is the algorithm I'm trying to implement:
attributes a_i, a_j are extracted from H and denoted as H_i^T ,H_j^T (where T denotes transpose).
We then apply the Pearson correlation on them denoted as ρi,j.
for example:
If we have:
H(mxn = 4x3) = IQ Height weight
30 180 80
30 170 60
40 183 85
10 190 95
ct = 0.7
A = {IQ, Height, Weight}
Then the result we should get is:
CS = {(C,0)}
Where C = {Height, Weight}
I would also love to get any visualiztion tools reccomandations.
Thanks for your help!
Pandas is your best friend when it comes to tabular data. I'm not an expert on linear algebra notation but it seems like what you're trying to do is append a tuple into a set if the two items in the tuple are correlated by more than the threshold value, ie. if Height and Weight have a correlation coefficient > 0.7, then add those two values into list CS. I would do something like this:
import pandas as pd
import seaborn as sns
df = pd.DataFrame.from_dict({
"IQ":[30,30,40,10],
"Height":[180,170,183,190,],
"Weight":[80,60,85,95]
})
lst = []
threshold = 0.7
p_arr = df.corr().to_dict()
for attr in p_arr:
for sub_attr in p_arr[attr]:
p = p_arr[attr][sub_attr]
if attr != sub_attr and p > threshold:
lst.append(((attr, sub_attr), p))
produces:
[(('Height', 'Weight'), 0.9956654266839726),
(('Weight', 'Height'), 0.9956654266839726)]
and for correlation heatmap
sns.heatmap(df.corr())
how to create a vector called row_min that contains the minimum value for each of the 25 rows (this implies the shape of this vector will be (25,)) Create a vector called col_max that contains the maximum value for each of the 8 columns (col_max will be a vector of shape (8,))
I have developed the code and I'm new to vector concept, need some suggestions.
import random
import numpy
c = numpy.random.rand(25,8)
print("Random float array 25X8 between range of 0.0 to 1.0 \n")
print(c,"\n")
I didn't find the source to understand the concept.
You have to specify the axis np.max( .., axis=...) should work on:
import random
import numpy as np
c = np.random.rand(5,3) # smaller for less output
print(c,"\n")
print( np.max(c, axis=0)) # column
print( np.max(c, axis=1)) # row
Output:
[[0.47894278 0.80356294 0.34453725]
[0.33802491 0.82795648 0.28438504]
[0.46838701 0.73664987 0.82215448]
[0.66245476 0.59981989 0.43837083]
[0.28515865 0.86093323 0.92248524]]
# axis 0 (columns)
[0.66245476 0.86093323 0.92248524]
# axis 1 (rows)
[0.80356294 0.82795648 0.82215448 0.66245476 0.92248524]
See matrix.max() ... min() works the same.
I have a very large dataset that I am running a clustering model on. The clustering outputs a numpy array formatted as such:
[ 0 1 2 1 1 0 0 0 1 2 1 0 2 0 1 2 1 0 2 2 0 0 1 ... ]
I want to take the original dataset, and create three datasets based on the array. How would I go about this?
Initial Dataset Work:
import pandas as pd
pd.options.mode.chained_assignment = None
raw_data = pd.read_csv("LendingClub2012to2013.csv", low_memory = False, skiprows=[0])
//Some cleaning done, target leakage removed, dummies created, imputation, etc.
clean_data = raw_data.drop(text2d + leakage2d + noinfo2d + irr2d, axis = 1)
I assume that your dataset is a numpy array. Try creating masks to select the elements you want from the original data set. Some verbose code:
# Your original data set (2d numpy array)
orig_data = ...
# The cluster assignments output by the algorithm (1d numpy array)
cluster_assignments = ...
clusters = []
for cluster_id in xrange(3):
mask = (cluster_assignments == cluster_id)
clusters.append(orig_data[mask])
A more concise version:
clusters = [orig_data[cluster_assignments == id] for id in xrange(3))]
If your dataset is a pandas Dataframe rather than a numpy array, simple replace orig_data[...] with orig_data.loc[...].
The output of this code is a list clusters in which each element is a dataset with the data for just one of the clusters.
I have a data which looks like (example)
x y d
0 0 -2
1 0 0
0 1 1
1 1 3
And I want to turn this into a coloumap plot which looks like one of these:
where x and y are in the table and the color is given by 'd'. However, I want a predetermined color for each number, for example:
-2 - orange
0 - blue
1 - red
3 - yellow
Not necessarily these colours but I need to address a number to a colour and the numbers are not in order or sequence, the are just a set of five or six random numbers which repeat themselves across the entire array.
Any ideas, I haven't got a code for that as I don't know where to start. I have however looked at the examples in here such as:
Matplotlib python change single color in colormap
However they only show how to define colours and not how to link those colours to an specific value.
It turns out this is harder than I thought, so maybe someone has an easier way of doing this.
Since we need to create an image of the data, we will store them in a 2D array. We can then map the data to the integers 0 .. number of different data values and assign a color to each of them. The reason is that we want the final colormap to be equally spaced. So
value -2 --> integer 0 --> color orange
value 0 --> integer 1 --> color blue
and so on.
Having nicely spaced integers, we can use a ListedColormap on the image of newly created integer values.
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.colors
# define the image as a 2D array
d = np.array([[-2,0],[1,3]])
# create a sorted list of all unique values from d
ticks = np.unique(d.flatten()).tolist()
# create a new array of same shape as d
# we will later use this to store values from 0 to number of unique values
dc = np.zeros(d.shape)
#fill the array dc
for i in range(d.shape[0]):
for j in range(d.shape[1]):
dc[i,j] = ticks.index(d[i,j])
# now we need n (= number of unique values) different colors
colors= ["orange", "blue", "red", "yellow"]
# and put them to a listed colormap
colormap = matplotlib.colors.ListedColormap(colors)
plt.figure(figsize=(5,3))
#plot the newly created array, shift the colorlimits,
# such that later the ticks are in the middle
im = plt.imshow(dc, cmap=colormap, interpolation="none", vmin=-0.5, vmax=len(colors)-0.5)
# create a colorbar with n different ticks
cbar = plt.colorbar(im, ticks=range(len(colors)) )
#set the ticklabels to the unique values from d
cbar.ax.set_yticklabels(ticks)
#set nice tickmarks on image
plt.gca().set_xticks(range(d.shape[1]))
plt.gca().set_yticks(range(d.shape[0]))
plt.show()
As it may not be intuitively clear how to get the array d in the shape needed for plotting with imshow, i.e. as 2D array, here are two ways of converting the input data columns:
import numpy as np
x = np.array([0,1,0,1])
y = np.array([ 0,0,1,1])
d_original = np.array([-2,0,1,3])
#### Method 1 ####
# Intuitive method.
# Assumption:
# * Indexing in x and y start at 0
# * every index pair occurs exactly once.
# Create an empty array of shape (n+1,m+1)
# where n is the maximum index in y and
# m is the maximum index in x
d = np.zeros((y.max()+1 , x.max()+1), dtype=np.int)
for k in range(len(d_original)) :
d[y[k],x[k]] = d_original[k]
print d
#### Method 2 ####
# Fast method
# Additional assumption:
# indizes in x and y are ordered exactly such
# that y is sorted ascendingly first,
# and for each index in y, x is sorted.
# In this case the original d array can bes simply reshaped
d2 = d_original.reshape((y.max()+1 , x.max()+1))
print d2