I have a list of of coordinates that have areas mapped out as follows
df=pd.DataFrame({'user_id':[55,55,356,356,356,356,632,752,938,963,963,1226,2663,2663,2663,2663,2663,3183,3197,3344,3387,3387,3387,3387,3396,3515,3536,3570,3819,3883,3883,3883,3883,3883,3883,3883,3883,3883,3883,3883,3883,4584,4594,4713,4931,4931,5026,5487,5487,5575,5575,5575,5602,5639,5639,5639,5639,5783,5783,5783,5783,5783,5801,6373,6718,6886,6886,7055,7055,7608,7608,7777,8186,8186,8307,8712,9271,9896,9991,9991,9991,],
'latitude':[13.2633943,13.2633964,12.809677124,12.8099212646,12.8100585938,12.810065981,12.9440132,12.2958104,12.5265661,13.0767648,13.0853577,12.6301221,12.8558120728,12.8558349609,12.8558654785,12.8558807373,12.8558959961,12.9141417,13.0696411133,13.0708333,10.7904833,10.7904833,10.7904833,12.884091,13.0694428,13.204637,12.6922086,13.0767648,13.3489958,12.8653798,12.8654014,12.8654124,12.8654448,12.8654521,12.8654658,12.8654733,12.8654815,12.8654844,12.8655367,12.8655376,12.865576,12.4025539,13.1986348,12.9548317,11.664325,11.6690603,13.0656551,13.1137554,13.1137978,12.770418,12.9141417,12.9141417,15.3530727,12.8285405054,12.8285925,12.8288406,12.829668,12.2958104,12.5583190918,12.7367172241,12.7399597168,12.7422103882,12.8631981,13.3378762,12.5638375681,13.1961683,13.1993678,12.1210997,12.5265661,13.1332778931,13.13331604,12.1210997,13.0649372,13.0658797,12.6955714,12.8213806152,13.0641708374,13.2013835,13.1154662,13.1957473755,13.2329025269,],
'longitude':[75.4341412,75.4341377,77.6955155017,77.6952344177,77.6952628334,77.6952629697,75.7926285,76.6393805,78.2149575,77.6397007,77.6445166,77.1145378,77.7985897361,77.7985953164,77.798622112,77.7985610742,77.7986275271,74.8559568,77.6520116309,77.6519444,78.7046725,78.7046725,78.7046725,74.8372421,77.6523596,77.6506622,78.6181131,77.6397007,74.7855559,77.7972191,77.7971733,77.7971429,77.7971621,77.7970823,77.7970327,77.7970371,77.7972272,77.7970335,77.7969649,77.796956,77.7971244,75.9811564,77.7065928,77.4739615,78.1460142,78.139311,77.4380296,77.5732437,77.573201,74.8609332,74.8559568,74.8559568,75.1386825,77.6891233027,77.6899376,77.6892531,77.6902955,76.6393805,77.7842363745,77.7841222429,77.7837989946,77.7830295359,77.4336428,77.117325,75.5833357573,77.7053231,77.7095658,78.1582143,78.2149575,77.5728687166,77.5729374436,78.1582143,77.7435873,77.7444939,78.0620963,77.6606095672,77.746332751,77.7082838,77.6069977,77.7055573788,77.6956690934,],
})
For the following latitude longitude pairs I am using DBSCAN to cluster them
X=np.array(df[['latitude', 'longitude']])
kms_per_radian = 6371.0088
epsilon = 1 / kms_per_radian
db = DBSCAN(eps=epsilon, min_samples=5)
model=db.fit(np.radians(X))
cluster_labels = db.labels_
num_clusters = len(set(cluster_labels))
cluster_labels = cluster_labels.astype(float)
cluster_labels[cluster_labels == -1] = np.nan
clusters = pd.Series( [X[cluster_labels==n] for n in range(num_clusters)] )
labels = pd.DataFrame(db.labels_,columns=['CLUSTER_LABEL'])
dfnew=pd.concat([df,labels],axis=1,sort=False)
How do I get the get the center point of these clusters and map it back to the dataset so that when I display the same in folium with a marker and the summary starts there?
So far I have tried
def get_centermost_point(cluster):
centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y)
centermost_point = min(cluster, key=lambda point: great_circle(point, centroid).m)
return tuple(centermost_point)
centermost_points = clusters.map(get_centermost_point)
which gives me a IndexError: list index out of range error
To get the coordinates of each cluster's centroid:
for ea in clusters:
print(MultiPoint(ea).centroid)
Outcome:
POINT (12.85585784912 77.79859915316)
POINT (12.86547048333333 77.79709629166666)
POINT (13.1982603551 77.70706457576)
POINT EMPTY
To create a geodataframe from the centroids and plot it.
(assuming the coordinates are long/lat)
# To create a geodataframe of the centroids
clusters_centroids = [MultiPoint(ea).centroid for ea in clusters]
crs = {'init': 'epsg:4326'}
cgdf = gpd.GeoDataFrame(clusters, crs=crs, geometry=clusters_centroids)
# Eliminate some empty row(s)
good_cdgf = cgdf[ ~cgdf['geometry'].is_empty ]
# plot to see the centroids
good_cdgf.plot()
The output plot:
To add the center points back into the original dataframe df.
Here I start with checking dfnew which is simply df with added column CLUSTER_LABEL.
print(dfnew)
user_id latitude longitude CLUSTER_LABEL
0 55 13.263394 75.434141 -1
1 55 13.263396 75.434138 -1
2 356 12.809677 77.695516 -1
3 356 12.809921 77.695234 -1
4 356 12.810059 77.695263 -1
.. ... ... ... ...
76 9271 13.064171 77.746333 -1
77 9896 13.201384 77.708284 2
78 9991 13.115466 77.606998 -1
79 9991 13.195747 77.705557 2
80 9991 13.232903 77.695669 -1
[81 rows x 4 columns]
The column CLUSTER_LABEL will be used to join and get values from cgdf dataframe.
Add a new CLUSTER_LABEL column with proper cluster's label values to cgdf
cgdf["CLUSTER_LABEL"] = [0,1,2, -1]
Drop column 0 of cgdf
cgdf.drop(columns=[0], axis=1, inplace=True)
Check current cgdf
print(cgdf)
geometry CLUSTER_LABEL
0 POINT (12.856 77.799) 0
1 POINT (12.865 77.797) 1
2 POINT (13.198 77.707) 2
3 POINT EMPTY -1
Merge two dataframes into new dataframe dfnew2.
dfnew2 = dfnew.merge(cgdf, on='CLUSTER_LABEL')
Check current status of dfnew2, it should look like this:
user_id latitude longitude CLUSTER_LABEL geometry
0 55 13.263394 75.434141 -1 POINT EMPTY
1 55 13.263396 75.434138 -1 POINT EMPTY
2 356 12.809677 77.695516 -1 POINT EMPTY
3 356 12.809921 77.695234 -1 POINT EMPTY
4 356 12.810059 77.695263 -1 POINT EMPTY
.. ... ... ... ... ...
76 4594 13.198635 77.706593 2 POINT (13.198 77.707)
77 6886 13.196168 77.705323 2 POINT (13.198 77.707)
78 6886 13.199368 77.709566 2 POINT (13.198 77.707)
79 9896 13.201384 77.708284 2 POINT (13.198 77.707)
80 9991 13.195747 77.705557 2 POINT (13.198 77.707)
[81 rows x 5 columns]
'dfnew2' should be equivalent with the original dataframe with 2 additional special columns, 'CLUSTER_LABEL' and 'geometry' (of cluster's center point).
try:
from sklearn.tree import DecisionTreeClassifier
except:
pass
from sklearn.cluster import KMeans
def kmeans_centers(list_of_lats_lngs): #type of input list of lists
try:
data = pd.DataFrame([list_of_lats_lngs],columns=['lat','lng'])
data['eventType']= "test"
data.dropna(axis=0,how='any',subset=['lat','lng'],inplace=True)
X=data.loc[:,['eventType','lat','lng']]
K_clusters = range(1,10)
kmeans = [KMeans(n_clusters=i) for i in K_clusters]
Y_axis = data[['lat']]
X_axis = data[['lng']]
kmeans = KMeans(n_clusters = 3, init ='k-means++')
kmeans.fit(X[X.columns[1:3]])
X['cluster_label'] = kmeans.fit_predict(X[X.columns[1:3]])
centers = kmeans.cluster_centers_ # Coordinates of cluster centers.
# labels = kmeans.predict(X[X.columns[1:3]]) # Labels of each point
return centers
except Exception as e:
print("kmeans - CLustering exception",e)
return None
Ready to use
Input
[[12.02,12.34],[12.12,12.04],[12.092,12.74],[22.02,13.34]]
Pandas has the very handy function to do pairwise correlation of columns using pd.corr().
That means it is possible to compare correlations between columns of any length. For instance:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 10)))
0 1 2 3 4 5 6 7 8 9
0 9 17 55 32 7 97 61 47 48 46
1 8 83 87 56 17 96 81 8 87 0
2 60 29 8 68 56 63 81 5 24 52
3 42 76 6 75 7 59 19 17 3 63
...
Now it is possible to test correlation between all 10 columns with df.corr(method='pearson'):
0 1 2 3 4 5 6 7 8 9
0 1.000000 0.082789 -0.094096 -0.086091 0.163091 0.013210 0.167204 -0.002514 0.097481 0.091020
1 0.082789 1.000000 0.027158 -0.080073 0.056364 -0.050978 -0.018428 -0.014099 -0.135125 -0.043797
2 -0.094096 0.027158 1.000000 -0.102975 0.101597 -0.036270 0.202929 0.085181 0.093723 -0.055824
3 -0.086091 -0.080073 -0.102975 1.000000 -0.149465 0.033130 -0.020929 0.183301 -0.003853 -0.062889
4 0.163091 0.056364 0.101597 -0.149465 1.000000 -0.007567 -0.017212 -0.086300 0.177247 -0.008612
5 0.013210 -0.050978 -0.036270 0.033130 -0.007567 1.000000 -0.080148 -0.080915 -0.004612 0.243713
6 0.167204 -0.018428 0.202929 -0.020929 -0.017212 -0.080148 1.000000 0.135348 0.070330 0.008170
7 -0.002514 -0.014099 0.085181 0.183301 -0.086300 -0.080915 0.135348 1.000000 -0.114413 -0.111642
8 0.097481 -0.135125 0.093723 -0.003853 0.177247 -0.004612 0.070330 -0.114413 1.000000 -0.153564
9 0.091020 -0.043797 -0.055824 -0.062889 -0.008612 0.243713 0.008170 -0.111642 -0.153564 1.000000
Is there a simple way to also get the corresponding p-values (ideally in pandas), as it is returned e.g. by scipy's kendalltau()?
Why not using the "method" argument of pandas.DataFrame.corr():
pearson : standard correlation coefficient.
kendall : Kendall Tau correlation coefficient.
spearman : Spearman rank correlation.
callable: callable with input two 1d ndarrays and returning a float.
from scipy.stats import kendalltau, pearsonr, spearmanr
def kendall_pval(x,y):
return kendalltau(x,y)[1]
def pearsonr_pval(x,y):
return pearsonr(x,y)[1]
def spearmanr_pval(x,y):
return spearmanr(x,y)[1]
and then
corr = df.corr(method=pearsonr_pval)
Probably just loop. It's basically what pandas does in the source code to generate the correlation matrix anyway:
import pandas as pd
import numpy as np
from scipy import stats
df_corr = pd.DataFrame() # Correlation matrix
df_p = pd.DataFrame() # Matrix of p-values
for x in df.columns:
for y in df.columns:
corr = stats.pearsonr(df[x], df[y])
df_corr.loc[x,y] = corr[0]
df_p.loc[x,y] = corr[1]
If you want to leverage the fact that this is symmetric, so you only need to calculate this for roughly half of them, then do:
mat = df.values.T
K = len(df.columns)
correl = np.empty((K,K), dtype=float)
p_vals = np.empty((K,K), dtype=float)
for i, ac in enumerate(mat):
for j, bc in enumerate(mat):
if i > j:
continue
else:
corr = stats.pearsonr(ac, bc)
#corr = stats.kendalltau(ac, bc)
correl[i,j] = corr[0]
correl[j,i] = corr[0]
p_vals[i,j] = corr[1]
p_vals[j,i] = corr[1]
df_p = pd.DataFrame(p_vals)
df_corr = pd.DataFrame(correl)
#pd.concat([df_corr, df_p], keys=['corr', 'p_val'])
This will work:
from scipy.stats import pearsonr
column_values = [column for column in df.columns.tolist() ]
df['Correlation_coefficent'], df['P-value'] = zip(*df.T.apply(lambda x: pearsonr(x[column_values ],x[column_values ])))
df_result = df[['Correlation_coefficent','P-value']]
Does this work for you?
#call the correlation function, you could round the values if needed
df_c = df_c.corr().round(1)
#get the p values
pval = df_c.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(*rho.shape)
#set the p values, *** for less than 0.001, ** for less than 0.01, * for less than 0.05
p = pval.applymap(lambda x: ''.join(['*' for t in [0.001,0.01,0.05] if x<=t]))
#dfc_2 below will give you the dataframe with correlation coefficients and p values
df_c2 = df_c.astype(str) + p
#you could also plot the correlation matrix using sns.heatmap if you want
#plot the triangle
matrix = np.triu(df_c.corr())
#convert to array for the heatmap
df_c3 = df_c2.to_numpy()
#plot the heatmap
plt.figure(figsize=(13,8))
sns.heatmap(df_c, annot = df_c3, fmt='', vmin=-1, vmax=1, center= 0, cmap= 'coolwarm', mask = matrix)
I'm trying to interpolate temperature data observed on an urban area formed by 5 locations. I am using cartopy to interpolate and draw the map, however, when I run the script the temperature interpolation is not shown and I only get the layer of the urban area with the color palette. Can someone help me fix this error? The link of shapefile is
https://www.dropbox.com/s/0u76k3yegvr09sx/LimiteAMG.shp?dl=0
https://www.dropbox.com/s/yxsmm3v2ey3ngsp/LimiteAMG.cpg?dl=0
https://www.dropbox.com/s/yx05n31dfkggbb6/LimiteAMG.dbf?dl=0
https://www.dropbox.com/s/a6nk0xczgjeen2d/LimiteAMG.prj?dl=0
https://www.dropbox.com/s/royw7s51n2f0a6x/LimiteAMG.qpj?dl=0
https://www.dropbox.com/s/7k44dcl1k5891qc/LimiteAMG.shx?dl=0
Data
Lat Lon tmax
0 20.8208 -103.4434 22.8
1 20.7019 -103.4728 17.7
2 20.6833 -103.3500 24.9
3 20.6280 -103.4261 NaN
4 20.7205 -103.3172 26.4
5 20.7355 -103.3782 25.7
6 20.6593 -103.4136 NaN
7 20.6740 -103.3842 25.8
8 20.7585 -103.3904 NaN
9 20.6230 -103.4265 NaN
10 20.6209 -103.5004 NaN
11 20.6758 -103.6439 24.5
12 20.7084 -103.3901 24.0
13 20.6353 -103.3994 23.0
14 20.5994 -103.4133 25.0
15 20.6302 -103.3422 NaN
16 20.7400 -103.3122 23.0
17 20.6061 -103.3475 NaN
18 20.6400 -103.2900 23.0
19 20.7248 -103.5305 24.0
20 20.6238 -103.2401 NaN
21 20.4753 -103.4451 NaN
Code:
import cartopy
import cartopy.crs as ccrs
from matplotlib.colors import BoundaryNorm
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import cartopy.io.shapereader as shpreader
from metpy.calc import get_wind_components
from metpy.cbook import get_test_data
from metpy.gridding.gridding_functions import interpolate, remove_nan_observation
from metpy.plots import add_metpy_logo
from metpy.units import units
to_proj = ccrs.PlateCarree()
data=pd.read_csv('/home/borisvladimir/Documentos/Datos/EMAs/EstacionesZMG/RedZMG.csv',usecols=(1,2,3),names=['Lat','Lon','tmax'],na_values=-99999,header=0)
fname='/home/borisvladimir/Dropbox/Diversos/Shapes/LimiteAMG.shp'
adm1_shapes = list(shpreader.Reader(fname).geometries())
lon = data['Lon'].values
lat = data['Lat'].values
xp, yp, _ = to_proj.transform_points(ccrs.Geodetic(), lon, lat).T
x_masked, y_masked, t = remove_nan_observations(xp, yp, data['tmax'].values)
#Interpola temp usando Cressman
tempx, tempy, temp = interpolate(x_masked, y_masked, t, interp_type='cressman', minimum_neighbors=3, search_radius=400000, hres=35000)
temp = np.ma.masked_where(np.isnan(temp), temp)
levels = list(range(-20, 20, 1))
cmap = plt.get_cmap('viridis')
norm = BoundaryNorm(levels, ncolors=cmap.N, clip=True)
fig = plt.figure(figsize=(15, 10))
view = fig.add_subplot(1, 1, 1, projection=to_proj)
view.add_geometries(adm1_shapes, ccrs.PlateCarree(),edgecolor='black', facecolor='white', alpha=0.5)
view.set_extent([-103.8, -103, 20.3, 21.099 ], ccrs.PlateCarree())
ZapLon,ZapLat=-103.50,20.80
GuadLon,GuadLat=-103.33,20.68
TonaLon,TonaLat=-103.21,20.62
TlaqLon,TlaqLat=-103.34,20.59
TlajoLon,TlajoLat=-103.44,20.47
plt.text(ZapLon,ZapLat,'Zapopan',transform=ccrs.Geodetic())
plt.text(GuadLon,GuadLat,'Guadalajara',transform=ccrs.Geodetic())
plt.text(TonaLon,TonaLat,'Tonala',transform=ccrs.Geodetic())
plt.text(TlaqLon,TlaqLat,'Tlaquepaque',transform=ccrs.Geodetic())
plt.text(TlajoLon,TlajoLat,'Tlajomulco',transform=ccrs.Geodetic())
mmb = view.pcolormesh(tempx, tempy, temp,transform=ccrs.PlateCarree(),cmap=cmap, norm=norm)
plt.colorbar(mmb, shrink=.4, pad=0.02, boundaries=levels)
plt.show()
The problem is in the call to MetPy's interpolate function. With the setting of hres=35000, it is generating a grid spaced at 35km. However, it appears that your data points are spaced much more closely than that; together, that results in a generated grid that has only two points, as shown as the red points below (black points are the original stations with non-masked data):
The result is that it only creates two points for the grid, both of which are outside the bounds of your data points; therefore those points end up masked. If instead we set hres to something much lower, say 5km (i.e. 5000), then a much more sensible result comes out:
I have the following dataframe that I wish to perform some regression on. I am using Seaborn but can't quite seem to find a non-linear function that fits. Below is my code and it's output, and below that is the dataframe I am using, df. Note I have truncated the axis in this plot.
I would like to fit either a Poisson or Gaussian distribution style of function.
import pandas
import seaborn
graph = seaborn.lmplot('$R$', 'Equilibrium Value', data = df, fit_reg=True, order=2, ci=None)
graph.set(xlim = (-0.25,10))
However this produces the following figure.
df
R Equilibrium Value
0 5.102041 7.849315e-03
1 4.081633 2.593005e-02
2 0.000000 9.990000e-01
3 30.612245 4.197446e-14
4 14.285714 6.730133e-07
5 12.244898 5.268202e-06
6 15.306122 2.403316e-07
7 39.795918 3.292955e-18
8 19.387755 3.875505e-09
9 45.918367 5.731842e-21
10 1.020408 9.936863e-01
11 50.000000 8.102142e-23
12 2.040816 7.647420e-01
13 48.979592 2.353931e-22
14 43.877551 4.787156e-20
15 34.693878 6.357120e-16
16 27.551020 9.610208e-13
17 29.591837 1.193193e-13
18 31.632653 1.474959e-14
19 3.061224 1.200807e-01
20 23.469388 6.153965e-11
21 33.673469 1.815181e-15
22 42.857143 1.381050e-19
23 25.510204 7.706746e-12
24 13.265306 1.883431e-06
25 9.183673 1.154141e-04
26 41.836735 3.979575e-19
27 36.734694 7.770915e-17
28 18.367347 1.089037e-08
29 44.897959 1.657448e-20
30 16.326531 8.575577e-08
31 28.571429 3.388120e-13
32 40.816327 1.145412e-18
33 11.224490 1.473268e-05
34 24.489796 2.178927e-11
35 21.428571 4.893541e-10
36 32.653061 5.177167e-15
37 8.163265 3.241799e-04
38 22.448980 1.736254e-10
39 46.938776 1.979881e-21
40 47.959184 6.830820e-22
41 26.530612 2.722925e-12
42 38.775510 9.456077e-18
43 6.122449 2.632851e-03
44 37.755102 2.712309e-17
45 10.204082 4.121137e-05
46 35.714286 2.223883e-16
47 20.408163 1.377819e-09
48 17.346939 3.057373e-08
49 7.142857 9.167507e-04
EDIT
Attached are two graphs produced from both this and another data set when increasing the order parameter beyond 20.
Order = 3
I have problems understanding why a lmplot is needed here. Usually you want to perform a fit by taking a model function and fit it to the data.
Assume you want a gaussian function
model = lambda x, A, x0, sigma, offset: offset+A*np.exp(-((x-x0)/sigma)**2)
you can fit it to your data with scipy.optimize.curve_fit:
popt, pcov = curve_fit(model, df["R"].values,
df["EquilibriumValue"].values, p0=[1,0,2,0])
Complete code:
import pandas as pd
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
df = ... # your dataframe
# plot data
plt.scatter(df["R"].values,df["EquilibriumValue"].values, label="data")
# Fitting
model = lambda x, A, x0, sigma, offset: offset+A*np.exp(-((x-x0)/sigma)**2)
popt, pcov = curve_fit(model, df["R"].values,
df["EquilibriumValue"].values, p0=[1,0,2,0])
#plot fit
x = np.linspace(df["R"].values.min(),df["R"].values.max(),250)
plt.plot(x,model(x,*popt), label="fit")
# Fitting
model2 = lambda x, sigma: model(x,1,0,sigma,0)
popt2, pcov2 = curve_fit(model2, df["R"].values,
df["EquilibriumValue"].values, p0=[2])
#plot fit2
x2 = np.linspace(df["R"].values.min(),df["R"].values.max(),250)
plt.plot(x2,model2(x2,*popt2), label="fit2")
plt.xlim(None,10)
plt.legend()
plt.show()