python numpy pairwise edit-distance

python numpy pairwise edit-distance - python

So, I have a numpy array of strings, and I want to calculate the pairwise edit-distance between each pair of elements using this function: scipy.spatial.distance.pdist from http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.spatial.distance.pdist.html
A sample of my array is as follows:
>>> d[0:10]
array(['TTTTT', 'ATTTT', 'CTTTT', 'GTTTT', 'TATTT', 'AATTT', 'CATTT',
'GATTT', 'TCTTT', 'ACTTT'],
dtype='|S5')
However, since it doesn't have the 'editdistance' option, therefore, I want to give a customized distance function. I tried this and I faced the following error:
>>> import editdist
>>> import scipy
>>> import scipy.spatial
>>> scipy.spatial.distance.pdist(d[0:10], lambda u,v: editdist.distance(u,v))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/epd-7.3.2/lib/python2.7/site-packages/scipy/spatial/distance.py", line 1150, in pdist
[X] = _copy_arrays_if_base_present([_convert_to_double(X)])
File "/usr/local/epd-7.3.2/lib/python2.7/site-packages/scipy/spatial/distance.py", line 153, in _convert_to_double
X = np.double(X)
ValueError: could not convert string to float: TTTTT

If you really must use pdist, you first need to convert your strings to numeric format. If you know that all strings will be the same length, you can do this rather easily:
numeric_d = d.view(np.uint8).reshape((len(d),-1))
This simply views your array of strings as a long array of uint8 bytes, then reshapes it such that each original string is on a row by itself. In your example, this would look like:
In [18]: d.view(np.uint8).reshape((len(d),-1))
Out[18]:
array([[84, 84, 84, 84, 84],
[65, 84, 84, 84, 84],
[67, 84, 84, 84, 84],
[71, 84, 84, 84, 84],
[84, 65, 84, 84, 84],
[65, 65, 84, 84, 84],
[67, 65, 84, 84, 84],
[71, 65, 84, 84, 84],
[84, 67, 84, 84, 84],
[65, 67, 84, 84, 84]], dtype=uint8)
Then, you can use pdist as you normally would. Just make sure that your editdist function is expecting arrays of integers, rather than strings. You could quickly convert your new inputs by calling .tostring():
def editdist(x, y):
s1 = x.tostring()
s2 = y.tostring()
... rest of function as before ...

def my_pdist(data,f):
N=len(data)
matrix=np.empty([N*(N-1)/2])
ind=0
for i in range(N):
for j in range(i+1,N):
matrix[ind]=f(data[i],data[j])
ind+=1
return matrix

Related

How can I use fill_between if the points I have are array with single value

I have a matplotlib script.
In it in the end I have
x_val=[x[0] for x in lista]
y_val=[x[1] for x in lista]
z_val=[x[2] for x in lista]
ax.plot(x_val,y_val,'.-')
ax.plot(x_val,z_val,'.-')
This script plots well eventhough the values in y_val and z_val are not strictly numbers
Debugging I have
(Pdb) x_val
[69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153]
(Pdb) y_val
[array(1.74204588), array(1.74162786), array(1.74060187), array(1.73956786), array(-1.89492498), array(-1.89225716), array(-1.89842406), array(-1.89143466), array(-1.89171231), array(-1.88730752), array(-1.89144205), nan, array(1.71829279), array(-1.88108125), array(-1.87515878), array(-1.87912412), array(-1.87015615), array(-1.87152107), array(-1.86639765), array(-1.87383146), array(-1.86896753), array(-1.87339903), array(-1.8748417), array(-1.88515482), array(-1.88263666), array(-1.88571425), nan, nan, array(1.72480822), array(1.73666841), array(-1.88835078), array(-1.88489648), array(-1.89135095), array(-1.88647712), array(-1.88697799), array(-1.88330942), array(-1.88929744), array(-1.88320532), array(-1.88466698), array(-1.87994435), array(-1.88546968), array(-1.88014776), array(-1.87803843), array(-1.87505217), array(-1.8797416), array(-1.87223076), array(-1.87333355), array(-1.86838693), array(-1.87577428), array(-1.86875561), array(-1.86872998), array(-1.86385078), array(-1.87095955), array(-1.86509266), array(-1.86601095), array(-1.86223456), array(-1.87151403), array(-1.86695325), array(-1.86540432), array(-1.86244142), array(-1.87018407), array(-1.86767604), array(-1.8699986), array(-1.87008087), array(-1.88049869), array(1.70057683), array(1.74942263), array(-1.86556665), array(-1.88470081), array(-1.90776552), array(-1.9103818), array(-1.91022515), array(-1.89490587), array(-1.89507617), array(-1.8875979), array(-1.89318633), array(-1.8942595), array(-1.902641), array(-1.89313615), array(-1.87870174), array(-1.86319541), array(-1.85999368), array(-1.85943922), array(-1.88398592), array(1.73030903)]
z_val similarly
This does not represent a problem
However I want to do
ax.fill_between(x_val,0,1,where=(y_val*z_val) >0,
color='green',alpha=0.5 )
It is a first attempt that I will probably modify (in this example for instance I don't understand yet what transform=ax.get_xaxis_transform() does) but the problem is that now I got an error
File "plotgt_vs_time.py", line 160, in plot
ax.fill_between(x_val,0,1,where=(y_val*z_val) >0,
TypeError: can't multiply sequence by non-int of type 'list'
I suppose it is because it is an array. How can I modify my code so as to be able to use fill_between?
I tried modifying it to
x_val=[x[0] for x in lista]
y_val=[x[1][0] for x in lista]
z_val=[x[2][0] for x in lista]
but this throws an error
IndexError: too many indices for array: array is 0-dimensional, but 1 were indexed
Then I modified it to
x_val=[x[0] for x in lista]
y_val=[float(x[1]) for x in lista]
z_val=[float(x[2]) for x in lista]
And now I only get floats, so I eliminated the 0-D arrays
but still got the error
TypeError: can't multiply sequence by non-int of type 'list'
How can I use fill_beetween?

In the end I solve it transforming the lists into numpy arrays
x_val=[x[0] for x in lista]
y_val=[float(x[1]) for x in lista]
z_val=[float(x[2]) for x in lista]
ax.fill_between(x_val,y_val,z_val,where=(np.array(y_val)*np.array(z_val)) >0,
color='red',alpha=0.5 )

Seaborn plots not correct [duplicate]

This question already has answers here:
Seaborn plots not showing up
(8 answers)
Closed 9 months ago.
import seaborn as sns, numpy as np
a = np.random.random((20, 20))
mask = np.zeros_like(a)
mask[np.tril_indices_from(mask)] = True #mask the lower triangle
with snenter code heres.axes_style("white"): #make the plot
ax = sns.heatmap(a, xticklabels=False, yticklabels=False, mask=mask, square=False, cmap="YlOrRd")
plt.show()
I make a Seaborn heatmap from an upper triangle numpy array.
This code using pandas:
import pandas as pd
df = pd.read_csv('datatraining.txt', sep=r',', engine='python', header=None, names = ['id', 'date','Temperature','Humidity','Light','CO2','HumidityRatio','Occupancy'])
df = df.drop([0])
df.index = pd.to_datetime(df.date)
df.drop('date', axis=1, inplace=True)
df = df.apply(pd.to_numeric)
def scale(df):
return (df - df.mean()) / df.std()
df.Temperature = scale(df.Temperature)
df.Humidity = scale(df.Humidity)
df.Light = scale(df.Light)
df.CO2 = scale(df.CO2)
df.HumidityRatio = scale(df.HumidityRatio)

I come to this question quite regularly and it always takes me a while to find what I search:
import seaborn as sns
import matplotlib.pyplot as plt
plt.show() # <--- This is what you are looking for
Please note: In Python 2, you can also use sns.plt.show(), but not in Python 3.
Complete Example
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""Visualize C_0.99 for all languages except the 10 with most characters."""
import seaborn as sns
import matplotlib.pyplot as plt
l = [41, 44, 46, 46, 47, 47, 48, 48, 49, 51, 52, 53, 53, 53, 53, 55, 55, 55,
55, 56, 56, 56, 56, 56, 56, 57, 57, 57, 57, 57, 57, 57, 57, 58, 58, 58,
58, 59, 59, 59, 59, 59, 59, 59, 59, 60, 60, 60, 60, 60, 60, 60, 60, 61,
61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 62, 62, 62, 62, 62, 62, 62, 62,
62, 63, 63, 63, 63, 63, 63, 63, 63, 63, 64, 64, 64, 64, 64, 64, 64, 65,
65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 66, 66, 66, 66, 66, 66, 66,
67, 67, 67, 67, 67, 67, 67, 67, 68, 68, 68, 68, 68, 69, 69, 69, 70, 70,
70, 70, 71, 71, 71, 71, 71, 72, 72, 72, 72, 73, 73, 73, 73, 73, 73, 73,
74, 74, 74, 74, 74, 75, 75, 75, 76, 77, 77, 78, 78, 79, 79, 79, 79, 80,
80, 80, 80, 81, 81, 81, 81, 83, 84, 84, 85, 86, 86, 86, 86, 87, 87, 87,
87, 87, 88, 90, 90, 90, 90, 90, 90, 91, 91, 91, 91, 91, 91, 91, 91, 92,
92, 93, 93, 93, 94, 95, 95, 96, 98, 98, 99, 100, 102, 104, 105, 107, 108,
109, 110, 110, 113, 113, 115, 116, 118, 119, 121]
sns.distplot(l, kde=True, rug=False)
plt.show()
Gives
this result

Plots created using seaborn need to be displayed like ordinary matplotlib plots. This can be done using the
plt.show()
function from matplotlib.
Originally I posted the solution to use the already imported matplotlib object from seaborn (sns.plt.show()) however this is considered to be a bad practice. Therefore, simply directly import the matplotlib.pyplot module and show your plots with
import matplotlib.pyplot as plt
plt.show()
If the IPython notebook is used the inline backend can be invoked to remove the necessity of calling show after each plot. The respective magic is
%matplotlib inline
Details which will affect how you store your data, like:
Give as much detail as you can; and I can help you develop a structure.
Size of data, # of rows, columns, types of columns; are you appending rows, or just columns?
What will typical operations look like. E.g. do a query on columns to select a bunch of rows and specific columns, then do an operation (in-memory), create new columns, save these.
(Giving a toy example could enable us to offer more specific recommendations.)
After that processing, then what do you do? Is step 2 ad hoc, or repeatable?
Input flat files: how many, rough total size in Gb. How are these organized e.g. by records? Does each one contains different fields, or do they have some records per file with all of the fields in each file?
Do you ever select subsets of rows (records) based on criteria (e.g. select the rows with field A > 5)? and then do something, or do you just select fields A, B, C with all of the records (and then do something)?
Do you 'work on' all of your columns (in groups), or are there a good proportion that you may only use for reports (e.g. you want to keep the data around, but don't need to pull in that column explicity until final results time)?

How perform unsupervised clustering on numbers in an Array using PyTorch

I got this array and I want to cluster/group the numbers into similar values.
An example of input array:
array([ 57, 58, 59, 60, 61, 78, 79, 80, 81, 82, 83, 101, 102, 103, 104, 105, 106]
expected result :
array([57,58,59,60,61]), ([78,79,80,81,82,83]), ([101,102,103,104,105,106])
I tried to use clustering but I don't think it's gonna work if I don't know how many I'm going to split up.
true = np.where(array>=1)
-> (array([ 57, 58, 59, 60, 61, 78, 79, 80, 81, 82, 83, 101, 102,
103, 104, 105, 106], dtype=int64),)

Dynamic binning requires explicit criteria and is not an easy problem to automate because each array may require a different set of thresholds to bin them efficiently.
I think Gaussian mixtures with a silhouette score criteria is the best bet you have. Here is a code for what you are trying to achieve. The silhouette scores help you determine the number of clusters/Gaussians you should use and is quite accurate and interpretable for 1D data.
import numpy as np
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
import scipy
import matplotlib.pyplot as plt
%matplotlib inline
#Sample data
x = [57, 58, 59, 60, 61, 78, 79, 80, 81, 82, 83, 101, 102, 103, 104, 105, 106]
#Fit a model onto the data
data = np.array(x).reshape(-1,1)
#change value of clusters to check best silhoutte score
print('Silhoutte scores')
scores = []
for n in range(2,11):
model = GaussianMixture(n).fit(data)
preds = model.predict(data)
score = silhouette_score(data, preds)
scores.append(score)
print(n,'->',score)
n_best = np.argmax(scores)+2 #because clusters start from 2
model = GaussianMixture(n_best).fit(data) #best model fit
#Get list of means and variances
mu = np.abs(model.means_.flatten())
sd = np.sqrt(np.abs(model.covariances_.flatten()))
#Plotting
extend_window = 50 #this is for zooming into or out of the graph, higher it is , more zoom out
x_values = np.arange(data.min()-extend_window, data.max()+extend_window, 0.1) #For plotting smooth graphs
plt.plot(data, np.zeros(data.shape), linestyle='None', markersize = 10.0, marker='o') #plot the data on x axis
#plot the different distributions (in this case 2 of them)
for i in range(num_components):
y_values = scipy.stats.norm(mu[i], sd[i])
plt.plot(x_values, y_values.pdf(x_values))
#split data by clusters
pred = model.predict(data)
output = np.split(x, np.sort(np.unique(pred, return_index=True)[1])[1:])
print(output)
Silhoutte scores
2 -> 0.699444729378163
3 -> 0.8962176943475543 #<--- selected as nbest
4 -> 0.7602523591781903
5 -> 0.5835620702692205
6 -> 0.5313888070615105
7 -> 0.4457049486461251
8 -> 0.4355742296918767
9 -> 0.13725490196078433
10 -> 0.2159663865546218
This creates 3 gaussians with the following distributions to split the data into clusters.
Arrays output finally split by similar values
#output -
[array([57, 58, 59, 60, 61]),
array([78, 79, 80, 81, 82, 83]),
array([101, 102, 103, 104, 105, 106])]

You can perform kind of derivation on this array so that you can track changes better, assume your array is:
A = np.array([ 57, 58, 59, 60, 61, 78, 79, 80, 81, 82, 83, 101, 102, 103, 104, 105, 106])
so you can make a derivation vector by simply convolving your vector with [-1 1]:
A_ = abs(np.convolve(A, np.array([-1, 1])))
then A_ is:
array([57, 1, 1, 1, 1, 17, 1, 1, 1, 1, 1, 18, 2, 1, 1, 1, 106]
now you can define a threshold like 5 and find the cluster boundaries.
THRESHOLD = 5
cluster_bounds = np.argwhere(A_ > THRESHOLD)
now cluster_bounds is:
array([[0], [5], [11], [16]], dtype=int32)

Difficulty with Python scipy.optimize curve fitting: Optimal parameters not found: Number of calls to function has reached maxfev = 1000

I'm having what I hope is an easy to correct issue with finding the parameters to a power law. I'm getting what looks to be a common error when using curve_fit, but haven't had success circumventing it with suggested solutions.
The error is:
Optimal parameters not found: Number of calls to function has reached maxfev = 1000.
Below is the data and powerlaw function I'm using. I was hoping someone might know what direction to point me.
import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import curve_fit
def powerlaw(x, amp, ex, x0, y0):
return (amp * np.power((x-x0),ex) + y0)
x = np.array([ 2.5 , 3.51778656, 4.53557312, 4.55335968,
5.57114625, 5.58893281, 5.60671937, 5.62450593,
5.64229249, 8.66007905, 8.67786561, 9.69565217,
9.71343874, 9.7312253 , 10.74901186, 10.76679842,
10.78458498, 11.80237154, 11.8201581 , 11.83794466,
11.85573123, 11.87351779, 12.89130435, 3.5 ,
3.48221344, 4.46442688, 4.44664032, 5.42885375,
5.41106719, 6.39328063, 6.37549407, 6.35770751,
6.33992095, 7.32213439, 8.30434783, 9.28656126,
11.2687747 , 11.25098814, 11.23320158, 11.21541502,
11.19762846, 11.1798419 , 12.16205534, 12.14426877,
12.12648221, 13.10869565])
y = np.array([52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
69, 70, 71, 72, 73, 74, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62,
63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74])
print(curve_fit(powerlaw, x, y, maxfev=1000))

Why not increasing maxfev to 1e6?
print(curve_fit(powerlaw, x, y, maxfev=1000000))
gives:
(array([ 7.56848833e-80, 3.07781530e+01, -4.06201617e+02, 3.43443918e+01]),
array([[ 7.35597960e-150, -1.37675497e-071, 1.91624708e-070, 4.08767916e-073],
[-1.37675497e-071, 2.57675332e+007, -3.58647084e+008, -7.65034124e+005],
[ 1.91624707e-070, -3.58647083e+008, 4.99188464e+009, 1.06502882e+007],
[ 4.08767868e-073, -7.65034033e+005, 1.06502870e+007, 2.28543208e+004]]))

How do you apply a function incorporating random numbers to rows of a numpy array in python?

So I have a 3D array with shape (28, 28, 60000), corresponding to 60000 28x28 images. I want to get random 24x24 chunks of each image by using the following function:
def crop(X):
x = random.randint(0,3)
y = random.randint(0,3)
return X[x:24+x, y:24+y,]
If I apply the function crop(X) to my matrix X, however, the same chunk from each sample is returned. How do I ensure each sample uses different randomly generated x and y values?

Here is my attempt at it.
Basically the idea is you will have to somehow split the matrix away from the last dimension (numpy doesn't let you apply over things which aren't a 1d array). You can do this using dsplit, and put it back together using dstack.
Then you would apply your crop function over each component. As a simplified example:
import random
a = np.array(range(300)).reshape(10,10,3)
def crop(X):
x = random.randint(0,3)
y = random.randint(0,3)
return X[x:3+x, y:3+y]
# we can loop over each component of the matrix by first splitting it
# off the last dimension:
b = [np.squeeze(x) for x in np.dsplit(a, a.shape[-1])]
# this will recreate the original matrix
c = np.dstack(b)
# so putting it together with the crop function
get_rand_matrix = [crop(np.squeeze(x)) for x in np.dsplit(a, a.shape[-1])]
desired_result = np.dstack(get_rand_matrix)

Here's a vectorized generic ( to handle non-squarish arrays as well) approach using NumPy broadcasting and linear indexing that generates the slices across all the images in one-go to produce a 3D array output, like so -
# Store shape
m,n,N = A.shape # A is the input array
# Set output block shape
out_blk_shape = (24,24)
x = np.random.randint(0,m-out_blk_shape[0]-1,(N))
y = np.random.randint(0,n-out_blk_shape[1]-1,(N))
# Get range arrays for the block across all images
R0 = np.arange(out_blk_shape[0])
R1 = np.arange(out_blk_shape[1])
# Get offset and thus all linear indices. Finally index into input array.
offset_idx = (y*n*N + x*N) + np.arange(N)
all_idx = R0[:,None]*n*N + R1*N + offset_idx[:,None,None]
out = A.ravel()[all_idx]
Sample run -
1) Inputs :
In [188]: A = np.random.randint(0,255,(6,7,2)) # Input array
In [189]: # Set output block shape
...: out_blk_shape = (3,2) # For demo reduced to a small shape
# Rest of the code stays the same.
In [190]: x # To select the start columns from the slice
Out[190]: array([1, 0])
In [191]: y # To select the start rows from the slice
Out[191]: array([1, 2])
In [192]: A[:,:,0]
Out[192]:
array([[ 75, 160, 110, 29, 77, 198, 78],
[237, 39, 219, 184, 73, 149, 144],
[138, 148, 243, 160, 165, 125, 17],
[155, 157, 110, 175, 91, 216, 61],
[101, 5, 209, 98, 212, 44, 63],
[213, 155, 96, 160, 193, 185, 157]])
In [193]: A[:,:,1]
Out[193]:
array([[201, 223, 7, 140, 98, 41, 167],
[139, 247, 134, 17, 74, 216, 0],
[ 44, 28, 26, 182, 45, 24, 34],
[178, 29, 233, 146, 157, 230, 173],
[111, 220, 234, 6, 246, 218, 149],
[200, 101, 23, 116, 166, 199, 233]])
2) Output :
In [194]: out
Out[194]:
array([[[ 39, 219],
[148, 243],
[157, 110]],
[[ 44, 28],
[178, 29],
[111, 220]]])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python numpy pairwise edit-distance - python

def my_pdist(data,f): N=len(data) matrix=np.empty([N*(N-1)/2]) ind=0 for i in range(N): for j in range(i+1,N): matrix[ind]=f(data[i],data[j]) ind+=1 return matrix

Related

How can I use fill_between if the points I have are array with single value

Seaborn plots not correct [duplicate]

How perform unsupervised clustering on numbers in an Array using PyTorch

Difficulty with Python scipy.optimize curve fitting: Optimal parameters not found: Number of calls to function has reached maxfev = 1000

How do you apply a function incorporating random numbers to rows of a numpy array in python?

Categories

Resources