Seaborn plots not correct [duplicate] - python

This question already has answers here:
Seaborn plots not showing up
(8 answers)
Closed 9 months ago.
import seaborn as sns, numpy as np
a = np.random.random((20, 20))
mask = np.zeros_like(a)
mask[np.tril_indices_from(mask)] = True #mask the lower triangle
with snenter code heres.axes_style("white"): #make the plot
ax = sns.heatmap(a, xticklabels=False, yticklabels=False, mask=mask, square=False, cmap="YlOrRd")
plt.show()
I make a Seaborn heatmap from an upper triangle numpy array.
This code using pandas:
import pandas as pd
df = pd.read_csv('datatraining.txt', sep=r',', engine='python', header=None, names = ['id', 'date','Temperature','Humidity','Light','CO2','HumidityRatio','Occupancy'])
df = df.drop([0])
df.index = pd.to_datetime(df.date)
df.drop('date', axis=1, inplace=True)
df = df.apply(pd.to_numeric)
def scale(df):
return (df - df.mean()) / df.std()
df.Temperature = scale(df.Temperature)
df.Humidity = scale(df.Humidity)
df.Light = scale(df.Light)
df.CO2 = scale(df.CO2)
df.HumidityRatio = scale(df.HumidityRatio)

I come to this question quite regularly and it always takes me a while to find what I search:
import seaborn as sns
import matplotlib.pyplot as plt
plt.show() # <--- This is what you are looking for
Please note: In Python 2, you can also use sns.plt.show(), but not in Python 3.
Complete Example
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""Visualize C_0.99 for all languages except the 10 with most characters."""
import seaborn as sns
import matplotlib.pyplot as plt
l = [41, 44, 46, 46, 47, 47, 48, 48, 49, 51, 52, 53, 53, 53, 53, 55, 55, 55,
55, 56, 56, 56, 56, 56, 56, 57, 57, 57, 57, 57, 57, 57, 57, 58, 58, 58,
58, 59, 59, 59, 59, 59, 59, 59, 59, 60, 60, 60, 60, 60, 60, 60, 60, 61,
61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 62, 62, 62, 62, 62, 62, 62, 62,
62, 63, 63, 63, 63, 63, 63, 63, 63, 63, 64, 64, 64, 64, 64, 64, 64, 65,
65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 66, 66, 66, 66, 66, 66, 66,
67, 67, 67, 67, 67, 67, 67, 67, 68, 68, 68, 68, 68, 69, 69, 69, 70, 70,
70, 70, 71, 71, 71, 71, 71, 72, 72, 72, 72, 73, 73, 73, 73, 73, 73, 73,
74, 74, 74, 74, 74, 75, 75, 75, 76, 77, 77, 78, 78, 79, 79, 79, 79, 80,
80, 80, 80, 81, 81, 81, 81, 83, 84, 84, 85, 86, 86, 86, 86, 87, 87, 87,
87, 87, 88, 90, 90, 90, 90, 90, 90, 91, 91, 91, 91, 91, 91, 91, 91, 92,
92, 93, 93, 93, 94, 95, 95, 96, 98, 98, 99, 100, 102, 104, 105, 107, 108,
109, 110, 110, 113, 113, 115, 116, 118, 119, 121]
sns.distplot(l, kde=True, rug=False)
plt.show()
Gives
this result

Plots created using seaborn need to be displayed like ordinary matplotlib plots. This can be done using the
plt.show()
function from matplotlib.
Originally I posted the solution to use the already imported matplotlib object from seaborn (sns.plt.show()) however this is considered to be a bad practice. Therefore, simply directly import the matplotlib.pyplot module and show your plots with
import matplotlib.pyplot as plt
plt.show()
If the IPython notebook is used the inline backend can be invoked to remove the necessity of calling show after each plot. The respective magic is
%matplotlib inline
Details which will affect how you store your data, like:
Give as much detail as you can; and I can help you develop a structure.
Size of data, # of rows, columns, types of columns; are you appending rows, or just columns?
What will typical operations look like. E.g. do a query on columns to select a bunch of rows and specific columns, then do an operation (in-memory), create new columns, save these.
(Giving a toy example could enable us to offer more specific recommendations.)
After that processing, then what do you do? Is step 2 ad hoc, or repeatable?
Input flat files: how many, rough total size in Gb. How are these organized e.g. by records? Does each one contains different fields, or do they have some records per file with all of the fields in each file?
Do you ever select subsets of rows (records) based on criteria (e.g. select the rows with field A > 5)? and then do something, or do you just select fields A, B, C with all of the records (and then do something)?
Do you 'work on' all of your columns (in groups), or are there a good proportion that you may only use for reports (e.g. you want to keep the data around, but don't need to pull in that column explicity until final results time)?

Related

How can I use fill_between if the points I have are array with single value

I have a matplotlib script.
In it in the end I have
x_val=[x[0] for x in lista]
y_val=[x[1] for x in lista]
z_val=[x[2] for x in lista]
ax.plot(x_val,y_val,'.-')
ax.plot(x_val,z_val,'.-')
This script plots well eventhough the values in y_val and z_val are not strictly numbers
Debugging I have
(Pdb) x_val
[69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153]
(Pdb) y_val
[array(1.74204588), array(1.74162786), array(1.74060187), array(1.73956786), array(-1.89492498), array(-1.89225716), array(-1.89842406), array(-1.89143466), array(-1.89171231), array(-1.88730752), array(-1.89144205), nan, array(1.71829279), array(-1.88108125), array(-1.87515878), array(-1.87912412), array(-1.87015615), array(-1.87152107), array(-1.86639765), array(-1.87383146), array(-1.86896753), array(-1.87339903), array(-1.8748417), array(-1.88515482), array(-1.88263666), array(-1.88571425), nan, nan, array(1.72480822), array(1.73666841), array(-1.88835078), array(-1.88489648), array(-1.89135095), array(-1.88647712), array(-1.88697799), array(-1.88330942), array(-1.88929744), array(-1.88320532), array(-1.88466698), array(-1.87994435), array(-1.88546968), array(-1.88014776), array(-1.87803843), array(-1.87505217), array(-1.8797416), array(-1.87223076), array(-1.87333355), array(-1.86838693), array(-1.87577428), array(-1.86875561), array(-1.86872998), array(-1.86385078), array(-1.87095955), array(-1.86509266), array(-1.86601095), array(-1.86223456), array(-1.87151403), array(-1.86695325), array(-1.86540432), array(-1.86244142), array(-1.87018407), array(-1.86767604), array(-1.8699986), array(-1.87008087), array(-1.88049869), array(1.70057683), array(1.74942263), array(-1.86556665), array(-1.88470081), array(-1.90776552), array(-1.9103818), array(-1.91022515), array(-1.89490587), array(-1.89507617), array(-1.8875979), array(-1.89318633), array(-1.8942595), array(-1.902641), array(-1.89313615), array(-1.87870174), array(-1.86319541), array(-1.85999368), array(-1.85943922), array(-1.88398592), array(1.73030903)]
z_val similarly
This does not represent a problem
However I want to do
ax.fill_between(x_val,0,1,where=(y_val*z_val) >0,
color='green',alpha=0.5 )
It is a first attempt that I will probably modify (in this example for instance I don't understand yet what transform=ax.get_xaxis_transform() does) but the problem is that now I got an error
File "plotgt_vs_time.py", line 160, in plot
ax.fill_between(x_val,0,1,where=(y_val*z_val) >0,
TypeError: can't multiply sequence by non-int of type 'list'
I suppose it is because it is an array. How can I modify my code so as to be able to use fill_between?
I tried modifying it to
x_val=[x[0] for x in lista]
y_val=[x[1][0] for x in lista]
z_val=[x[2][0] for x in lista]
but this throws an error
IndexError: too many indices for array: array is 0-dimensional, but 1 were indexed
Then I modified it to
x_val=[x[0] for x in lista]
y_val=[float(x[1]) for x in lista]
z_val=[float(x[2]) for x in lista]
And now I only get floats, so I eliminated the 0-D arrays
but still got the error
TypeError: can't multiply sequence by non-int of type 'list'
How can I use fill_beetween?
In the end I solve it transforming the lists into numpy arrays
x_val=[x[0] for x in lista]
y_val=[float(x[1]) for x in lista]
z_val=[float(x[2]) for x in lista]
ax.fill_between(x_val,y_val,z_val,where=(np.array(y_val)*np.array(z_val)) >0,
color='red',alpha=0.5 )

Spark UDF: Apply np.sum over a list of values in a data frame and filter values based on threshold

Very knew to using spark for data manipulation and UDF. I have a sample df with different test scores. There are 50 different columns like these. I am trying to define a custom apply function to filter values (total counts in each row) which are greater than 80.
test_scores
[65, 92, 96, 72, 70, 85, 72, 74, 79, 10, 82]
[59, 81, 91, 69, 66, 75, 65, 61, 71, 85, 69]
Below is what I am trying:
customfunc = udf(lambda val: (np.sum(val > 30)))
df2 = (df.withColumn('scores' ,customfunc('test_scores')))
Getting the below error:
TypeError: '>' not supported between instances of 'tuple' and 'str'

How perform unsupervised clustering on numbers in an Array using PyTorch

I got this array and I want to cluster/group the numbers into similar values.
An example of input array:
array([ 57, 58, 59, 60, 61, 78, 79, 80, 81, 82, 83, 101, 102, 103, 104, 105, 106]
expected result :
array([57,58,59,60,61]), ([78,79,80,81,82,83]), ([101,102,103,104,105,106])
I tried to use clustering but I don't think it's gonna work if I don't know how many I'm going to split up.
true = np.where(array>=1)
-> (array([ 57, 58, 59, 60, 61, 78, 79, 80, 81, 82, 83, 101, 102,
103, 104, 105, 106], dtype=int64),)
Dynamic binning requires explicit criteria and is not an easy problem to automate because each array may require a different set of thresholds to bin them efficiently.
I think Gaussian mixtures with a silhouette score criteria is the best bet you have. Here is a code for what you are trying to achieve. The silhouette scores help you determine the number of clusters/Gaussians you should use and is quite accurate and interpretable for 1D data.
import numpy as np
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
import scipy
import matplotlib.pyplot as plt
%matplotlib inline
#Sample data
x = [57, 58, 59, 60, 61, 78, 79, 80, 81, 82, 83, 101, 102, 103, 104, 105, 106]
#Fit a model onto the data
data = np.array(x).reshape(-1,1)
#change value of clusters to check best silhoutte score
print('Silhoutte scores')
scores = []
for n in range(2,11):
model = GaussianMixture(n).fit(data)
preds = model.predict(data)
score = silhouette_score(data, preds)
scores.append(score)
print(n,'->',score)
n_best = np.argmax(scores)+2 #because clusters start from 2
model = GaussianMixture(n_best).fit(data) #best model fit
#Get list of means and variances
mu = np.abs(model.means_.flatten())
sd = np.sqrt(np.abs(model.covariances_.flatten()))
#Plotting
extend_window = 50 #this is for zooming into or out of the graph, higher it is , more zoom out
x_values = np.arange(data.min()-extend_window, data.max()+extend_window, 0.1) #For plotting smooth graphs
plt.plot(data, np.zeros(data.shape), linestyle='None', markersize = 10.0, marker='o') #plot the data on x axis
#plot the different distributions (in this case 2 of them)
for i in range(num_components):
y_values = scipy.stats.norm(mu[i], sd[i])
plt.plot(x_values, y_values.pdf(x_values))
#split data by clusters
pred = model.predict(data)
output = np.split(x, np.sort(np.unique(pred, return_index=True)[1])[1:])
print(output)
Silhoutte scores
2 -> 0.699444729378163
3 -> 0.8962176943475543 #<--- selected as nbest
4 -> 0.7602523591781903
5 -> 0.5835620702692205
6 -> 0.5313888070615105
7 -> 0.4457049486461251
8 -> 0.4355742296918767
9 -> 0.13725490196078433
10 -> 0.2159663865546218
This creates 3 gaussians with the following distributions to split the data into clusters.
Arrays output finally split by similar values
#output -
[array([57, 58, 59, 60, 61]),
array([78, 79, 80, 81, 82, 83]),
array([101, 102, 103, 104, 105, 106])]
You can perform kind of derivation on this array so that you can track changes better, assume your array is:
A = np.array([ 57, 58, 59, 60, 61, 78, 79, 80, 81, 82, 83, 101, 102, 103, 104, 105, 106])
so you can make a derivation vector by simply convolving your vector with [-1 1]:
A_ = abs(np.convolve(A, np.array([-1, 1])))
then A_ is:
array([57, 1, 1, 1, 1, 17, 1, 1, 1, 1, 1, 18, 2, 1, 1, 1, 106]
now you can define a threshold like 5 and find the cluster boundaries.
THRESHOLD = 5
cluster_bounds = np.argwhere(A_ > THRESHOLD)
now cluster_bounds is:
array([[0], [5], [11], [16]], dtype=int32)

Difficulty with Python scipy.optimize curve fitting: Optimal parameters not found: Number of calls to function has reached maxfev = 1000

I'm having what I hope is an easy to correct issue with finding the parameters to a power law. I'm getting what looks to be a common error when using curve_fit, but haven't had success circumventing it with suggested solutions.
The error is:
Optimal parameters not found: Number of calls to function has reached maxfev = 1000.
Below is the data and powerlaw function I'm using. I was hoping someone might know what direction to point me.
import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import curve_fit
def powerlaw(x, amp, ex, x0, y0):
return (amp * np.power((x-x0),ex) + y0)
x = np.array([ 2.5 , 3.51778656, 4.53557312, 4.55335968,
5.57114625, 5.58893281, 5.60671937, 5.62450593,
5.64229249, 8.66007905, 8.67786561, 9.69565217,
9.71343874, 9.7312253 , 10.74901186, 10.76679842,
10.78458498, 11.80237154, 11.8201581 , 11.83794466,
11.85573123, 11.87351779, 12.89130435, 3.5 ,
3.48221344, 4.46442688, 4.44664032, 5.42885375,
5.41106719, 6.39328063, 6.37549407, 6.35770751,
6.33992095, 7.32213439, 8.30434783, 9.28656126,
11.2687747 , 11.25098814, 11.23320158, 11.21541502,
11.19762846, 11.1798419 , 12.16205534, 12.14426877,
12.12648221, 13.10869565])
y = np.array([52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
69, 70, 71, 72, 73, 74, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62,
63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74])
print(curve_fit(powerlaw, x, y, maxfev=1000))
Why not increasing maxfev to 1e6?
print(curve_fit(powerlaw, x, y, maxfev=1000000))
gives:
(array([ 7.56848833e-80, 3.07781530e+01, -4.06201617e+02, 3.43443918e+01]),
array([[ 7.35597960e-150, -1.37675497e-071, 1.91624708e-070, 4.08767916e-073],
[-1.37675497e-071, 2.57675332e+007, -3.58647084e+008, -7.65034124e+005],
[ 1.91624707e-070, -3.58647083e+008, 4.99188464e+009, 1.06502882e+007],
[ 4.08767868e-073, -7.65034033e+005, 1.06502870e+007, 2.28543208e+004]]))

python numpy pairwise edit-distance

So, I have a numpy array of strings, and I want to calculate the pairwise edit-distance between each pair of elements using this function: scipy.spatial.distance.pdist from http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.spatial.distance.pdist.html
A sample of my array is as follows:
>>> d[0:10]
array(['TTTTT', 'ATTTT', 'CTTTT', 'GTTTT', 'TATTT', 'AATTT', 'CATTT',
'GATTT', 'TCTTT', 'ACTTT'],
dtype='|S5')
However, since it doesn't have the 'editdistance' option, therefore, I want to give a customized distance function. I tried this and I faced the following error:
>>> import editdist
>>> import scipy
>>> import scipy.spatial
>>> scipy.spatial.distance.pdist(d[0:10], lambda u,v: editdist.distance(u,v))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/epd-7.3.2/lib/python2.7/site-packages/scipy/spatial/distance.py", line 1150, in pdist
[X] = _copy_arrays_if_base_present([_convert_to_double(X)])
File "/usr/local/epd-7.3.2/lib/python2.7/site-packages/scipy/spatial/distance.py", line 153, in _convert_to_double
X = np.double(X)
ValueError: could not convert string to float: TTTTT
If you really must use pdist, you first need to convert your strings to numeric format. If you know that all strings will be the same length, you can do this rather easily:
numeric_d = d.view(np.uint8).reshape((len(d),-1))
This simply views your array of strings as a long array of uint8 bytes, then reshapes it such that each original string is on a row by itself. In your example, this would look like:
In [18]: d.view(np.uint8).reshape((len(d),-1))
Out[18]:
array([[84, 84, 84, 84, 84],
[65, 84, 84, 84, 84],
[67, 84, 84, 84, 84],
[71, 84, 84, 84, 84],
[84, 65, 84, 84, 84],
[65, 65, 84, 84, 84],
[67, 65, 84, 84, 84],
[71, 65, 84, 84, 84],
[84, 67, 84, 84, 84],
[65, 67, 84, 84, 84]], dtype=uint8)
Then, you can use pdist as you normally would. Just make sure that your editdist function is expecting arrays of integers, rather than strings. You could quickly convert your new inputs by calling .tostring():
def editdist(x, y):
s1 = x.tostring()
s2 = y.tostring()
... rest of function as before ...
def my_pdist(data,f):
N=len(data)
matrix=np.empty([N*(N-1)/2])
ind=0
for i in range(N):
for j in range(i+1,N):
matrix[ind]=f(data[i],data[j])
ind+=1
return matrix

Categories