I have a func_df with 4 functions:
x y1 y2 y3 y4
0 -20.0 -0.839071 10.0 0.816164 -8795.000
1 -19.9 -0.865213 9.9 0.994372 -8667.619
2 -19.8 -0.889191 9.8 1.162644 -8541.472
3 -19.7 -0.910947 9.7 1.319299 -8416.553
4 -19.6 -0.930426 9.6 1.462772 -8292.856
.. ... ... ... ... ...
395 19.5 -0.947580 9.5 1.591630 6659.375
396 19.6 -0.930426 9.6 1.462772 6766.216
397 19.7 -0.910947 9.7 1.319299 6874.193
398 19.8 -0.889191 9.8 1.162644 6983.312
399 19.9 -0.865213 9.9 0.994372 7093.579
And a test_df with scatter points:
x y
0 -6.2 0.360801
1 6.4 -3.655422
2 -17.6 -6065.659700
3 -1.5 -3.247304
4 -17.7 -0.785430
.. ... ...
95 1.6 3.722551
96 16.3 -1.067487
97 -13.3 1.857445
98 -3.8 -0.008831
99 -13.2 1.294064
I want to find the deviation(distance) between all the scatter points and the 4 functions when the x-value is the same on both data frames.
There are some scatter points with same x-value and different y-value.
Edit: A quick example:
Starting with column y1 from func_df:
1st value is x = -20.0 , y1 = -0.839071.
I want the program to search if there a row in which x = -20.0 in test_df and if so, then find the difference between the y-value of that row and the y-value of func_df, which is -0.839071.
Imagine that in test_df there is a row with x = -20, y = -1. Then what I want is abs(-1 - 0.839071).
I used abs() because the distance has to be a positive value
This was for the row 0 of column y1. I need it for all rows and also for y2, y3 and y4 of func_df.
I tried something like this:
if test_df.x.equals(func.x):
result_df = func.iloc[:, 1:5].apply(lambda cell: cell - test_df.y[cell.index])
But honestly was a shot in the dark, no idea what I'm doing.
create sample data:
func_df=pd.DataFrame(data={'x':[-20.9,-20.8,-20.7,-20.6],'y1':[-0.12,-0.021,-0.04,-0.91],
'y2':[10.0,9.9,9.8,9.7],'y3':[0.99437,1.162644,1.319299,1.462772],
'y4':[-8667.619,-8541.472,-8416.553,-8292.856]})
print(func_df)
'''
x y1 y2 y3 y4
0 -20.9 -0.120 10.0 0.994370 -8667.619
1 -20.8 -0.021 9.9 1.162644 -8541.472
2 -20.7 -0.040 9.8 1.319299 -8416.553
3 -20.6 -0.910 9.7 1.462772 -8292.856
'''
test_df=pd.DataFrame(data={'x':[-20.9,-15.2],'y':[0.360801,-3.655422]})
print(test_df)
'''
x y
0 -20.9 0.360801
1 -15.2 -3.655422
'''
as you can see the first rows match in both dfs. Now let's combine these two df's. Use this to get only matching rows.
final=func_df.merge(test_df,on='x')
print(final)
'''
x y1 y2 y3 y4 y
0 -20.9 -0.12 10.0 0.99437 -8667.619 0.360801
'''
#if you want to see all values use final=func_df.merge(test_df,how='left',on='x')
Now that we have the matching rows and the y value in func_df we can do the calculation.
loop_cols=[*final.columns[1:5]] #['y1', 'y2', 'y3', 'y4']
for i in loop_cols:
final['distance_{}'.format(i)]=abs(final['y'] - final[i])
print(final)
'''
x y1 y2 y3 y4 y distance_y1 distance_y2 distance_y3 distance_y4
0 -20.9 -0.12 10.0 0.99437 -8667.619 0.360801 0.480801 9.639199 0.633569 8667.979801000001
'''
If the values of 'x' in test_df are unique you could merge the two dataframes on 'x'
merged_df = pandas.merge(test_df, func_df, on='x')
abs_delta_y1 = (merged_df['y'] - merged_df['y1']).abs()
etc...
I agree with Ray Pelletier - unfortunately I do not have enough credit to comment.
If you merge the two data frames on x, then you can create a new dataframe, where you, for each row in the merged dataframe, can calculate the difference between y and y1, y and y2, and so on.
If you read up on the merge function, there is a parameter you can set to "inner". Then the merged dataframe will only contain values of x which are present in both dataframes.
I am trying to calculate the sum of sales for stores in the same neighborhood based on their geographic coordinates. I have sample data:
data={'ID':['1','2','3','4'],'SALE':[100,120,110,95],'X':[23,22,21,24],'Y':[44,45,41,46],'X_MIN':[22,21,20,23],'Y_MIN':[43,44,40,45],'X_MAX':[24,23,22,25],'Y_MAX':[45,46,42,47]}
ID
SALE
X
Y
X_MIN
Y_MIN
X_MAX
Y_MAX
1
100
23
44
22
43
24
45
2
120
22
45
21
44
23
46
3
110
21
41
20
40
22
42
4
95
24
46
23
45
25
47
X and Y are the coordinates of the store. X and Y with MIN and MAX are the area they cover. For each row, I want to sum sales for all stores that are within the boundaries of the single store. I expect results similar to the table below where SUM for ID 1 is equal 220 because the coordinates (X and Y) are within the MIN and MAX limits of this store for ID 1 and ID 2 while for ID 4 only this one store is between his coordinates so the sum of sales is equal 95.
final={'ID':['1','2','3','4'],'SUM':[220,220,110,95]}
ID
SUM
1
220
2
220
3
110
4
95
What I've tried:
data['SUM'] = data.apply(lambda x: data['SALE'].sum(data[(data['X'] >= x['X_MIN'])&(data['X'] <= x['X_MAX'])&(data['Y'] >= x['Y_MIN'])&(data['Y'] <= x['Y_MAX'])]),axis=1)
Unfortunately the code does not work and I am getting the following error:
TypeError: unhashable type: 'DataFrame'
I am asking for help in solving this problem.
If you put the summation at the end, your solution works:
data['SUM'] = data.apply(lambda x: (data['SALE'][(data['X'] >= x['X_MIN'])&(data['X'] <= x['X_MAX'])&(data['Y'] >= x['Y_MIN'])&(data['Y'] <= x['Y_MAX'])]).sum(),axis=1)
###output of data['SUM']:
###0 220
###1 220
###2 110
###3 95
I have an xarray dataset of sea surface temperature values on an x/y grid. x and y are 1D vector coordinates, so it looks like this minimal example:
<xarray.Dataset>
Dimensions: (x: 10, y: 10)
Coordinates:
* x (x) int64 0 1 2 3 4 5 6 7 8 9
* y (y) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
data (x, y) float64 0.559 0.01037 0.1562 ... 0.08778 0.3272 0.8661
I am able to compute the lat/lon from this x/y grid, and the output is 2 2D arrays. I can add them as coordinates with ds.assign_coords:
<xarray.Dataset>
Dimensions: (x: 10, y: 10)
Coordinates:
* x (x) int64 0 1 2 3 4 5 6 7 8 9
* y (y) int64 0 1 2 3 4 5 6 7 8 9
lat (x, y) float64 30.0 30.0 30.0 30.0 30.0 ... 39.0 39.0 39.0 39.0
lon (x, y) float64 -120.0 -119.0 -118.0 -117.0 ... -113.0 -112.0 -111.0
Data variables:
data (x, y) float64 0.559 0.01037 0.1562 ... 0.08778 0.3272 0.8661
But I'd like to .sel along slices of the lat/lon. This currently isn't possible, as I get the error:
ds.sel(lat=slice(32,36), lon=slice(-118, -115))
ValueError Traceback (most recent call last)
<ipython-input-20-28c79202d5f3> in <module>
----> 1 ds.sel(lat=slice(32,36), lon=slice(-118, -115))
~/.local/lib/python3.8/site-packages/xarray/core/dataset.py in sel(self, indexers, method, tolerance, drop, **indexers_kwargs)
2363 """
2364 indexers = either_dict_or_kwargs(indexers, indexers_kwargs, "sel")
-> 2365 pos_indexers, new_indexes = remap_label_indexers(
2366 self, indexers=indexers, method=method, tolerance=tolerance
2367 )
~/.local/lib/python3.8/site-packages/xarray/core/coordinates.py in remap_label_indexers(obj, indexers, method, tolerance, **indexers_kwargs)
419 }
420
--> 421 pos_indexers, new_indexes = indexing.remap_label_indexers(
422 obj, v_indexers, method=method, tolerance=tolerance
423 )
~/.local/lib/python3.8/site-packages/xarray/core/indexing.py in remap_label_indexers(data_obj, indexers, method, tolerance)
256 new_indexes = {}
257
--> 258 dim_indexers = get_dim_indexers(data_obj, indexers)
259 for dim, label in dim_indexers.items():
260 try:
~/.local/lib/python3.8/site-packages/xarray/core/indexing.py in get_dim_indexers(data_obj, indexers)
222 ]
223 if invalid:
--> 224 raise ValueError(f"dimensions or multi-index levels {invalid!r} do not exist")
225
226 level_indexers = defaultdict(dict)
ValueError: dimensions or multi-index levels ['lat', 'lon'] do not exist
So my question is this: How can I change the dimensions of data to be (lon: (10,10), lat: (10,10)) instead of (x: 10, y: 10? Is this even possible?
Code to reproduce the example dataset:
import numpy as np
import xarray as xr
# Create sample data
data = np.random.rand(10,10)
x = y = np.arange(10)
# Set up dataset
ds = xr.Dataset(
data_vars = dict(
data = (["x", "y"], data)
),
coords= {
"x" : x,
"y" : y
}
)
# Create example lat/lon and assign to dataset
lon, lat = np.meshgrid(np.linspace(-120, -111, 10), np.linspace(30, 39, 10))
ds = ds.assign_coords({
"lat": (["x", "y"], lat),
"lon": (["x", "y"], lon)
})
I have a list of of coordinates that have areas mapped out as follows
df=pd.DataFrame({'user_id':[55,55,356,356,356,356,632,752,938,963,963,1226,2663,2663,2663,2663,2663,3183,3197,3344,3387,3387,3387,3387,3396,3515,3536,3570,3819,3883,3883,3883,3883,3883,3883,3883,3883,3883,3883,3883,3883,4584,4594,4713,4931,4931,5026,5487,5487,5575,5575,5575,5602,5639,5639,5639,5639,5783,5783,5783,5783,5783,5801,6373,6718,6886,6886,7055,7055,7608,7608,7777,8186,8186,8307,8712,9271,9896,9991,9991,9991,],
'latitude':[13.2633943,13.2633964,12.809677124,12.8099212646,12.8100585938,12.810065981,12.9440132,12.2958104,12.5265661,13.0767648,13.0853577,12.6301221,12.8558120728,12.8558349609,12.8558654785,12.8558807373,12.8558959961,12.9141417,13.0696411133,13.0708333,10.7904833,10.7904833,10.7904833,12.884091,13.0694428,13.204637,12.6922086,13.0767648,13.3489958,12.8653798,12.8654014,12.8654124,12.8654448,12.8654521,12.8654658,12.8654733,12.8654815,12.8654844,12.8655367,12.8655376,12.865576,12.4025539,13.1986348,12.9548317,11.664325,11.6690603,13.0656551,13.1137554,13.1137978,12.770418,12.9141417,12.9141417,15.3530727,12.8285405054,12.8285925,12.8288406,12.829668,12.2958104,12.5583190918,12.7367172241,12.7399597168,12.7422103882,12.8631981,13.3378762,12.5638375681,13.1961683,13.1993678,12.1210997,12.5265661,13.1332778931,13.13331604,12.1210997,13.0649372,13.0658797,12.6955714,12.8213806152,13.0641708374,13.2013835,13.1154662,13.1957473755,13.2329025269,],
'longitude':[75.4341412,75.4341377,77.6955155017,77.6952344177,77.6952628334,77.6952629697,75.7926285,76.6393805,78.2149575,77.6397007,77.6445166,77.1145378,77.7985897361,77.7985953164,77.798622112,77.7985610742,77.7986275271,74.8559568,77.6520116309,77.6519444,78.7046725,78.7046725,78.7046725,74.8372421,77.6523596,77.6506622,78.6181131,77.6397007,74.7855559,77.7972191,77.7971733,77.7971429,77.7971621,77.7970823,77.7970327,77.7970371,77.7972272,77.7970335,77.7969649,77.796956,77.7971244,75.9811564,77.7065928,77.4739615,78.1460142,78.139311,77.4380296,77.5732437,77.573201,74.8609332,74.8559568,74.8559568,75.1386825,77.6891233027,77.6899376,77.6892531,77.6902955,76.6393805,77.7842363745,77.7841222429,77.7837989946,77.7830295359,77.4336428,77.117325,75.5833357573,77.7053231,77.7095658,78.1582143,78.2149575,77.5728687166,77.5729374436,78.1582143,77.7435873,77.7444939,78.0620963,77.6606095672,77.746332751,77.7082838,77.6069977,77.7055573788,77.6956690934,],
})
For the following latitude longitude pairs I am using DBSCAN to cluster them
X=np.array(df[['latitude', 'longitude']])
kms_per_radian = 6371.0088
epsilon = 1 / kms_per_radian
db = DBSCAN(eps=epsilon, min_samples=5)
model=db.fit(np.radians(X))
cluster_labels = db.labels_
num_clusters = len(set(cluster_labels))
cluster_labels = cluster_labels.astype(float)
cluster_labels[cluster_labels == -1] = np.nan
clusters = pd.Series( [X[cluster_labels==n] for n in range(num_clusters)] )
labels = pd.DataFrame(db.labels_,columns=['CLUSTER_LABEL'])
dfnew=pd.concat([df,labels],axis=1,sort=False)
How do I get the get the center point of these clusters and map it back to the dataset so that when I display the same in folium with a marker and the summary starts there?
So far I have tried
def get_centermost_point(cluster):
centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y)
centermost_point = min(cluster, key=lambda point: great_circle(point, centroid).m)
return tuple(centermost_point)
centermost_points = clusters.map(get_centermost_point)
which gives me a IndexError: list index out of range error
To get the coordinates of each cluster's centroid:
for ea in clusters:
print(MultiPoint(ea).centroid)
Outcome:
POINT (12.85585784912 77.79859915316)
POINT (12.86547048333333 77.79709629166666)
POINT (13.1982603551 77.70706457576)
POINT EMPTY
To create a geodataframe from the centroids and plot it.
(assuming the coordinates are long/lat)
# To create a geodataframe of the centroids
clusters_centroids = [MultiPoint(ea).centroid for ea in clusters]
crs = {'init': 'epsg:4326'}
cgdf = gpd.GeoDataFrame(clusters, crs=crs, geometry=clusters_centroids)
# Eliminate some empty row(s)
good_cdgf = cgdf[ ~cgdf['geometry'].is_empty ]
# plot to see the centroids
good_cdgf.plot()
The output plot:
To add the center points back into the original dataframe df.
Here I start with checking dfnew which is simply df with added column CLUSTER_LABEL.
print(dfnew)
user_id latitude longitude CLUSTER_LABEL
0 55 13.263394 75.434141 -1
1 55 13.263396 75.434138 -1
2 356 12.809677 77.695516 -1
3 356 12.809921 77.695234 -1
4 356 12.810059 77.695263 -1
.. ... ... ... ...
76 9271 13.064171 77.746333 -1
77 9896 13.201384 77.708284 2
78 9991 13.115466 77.606998 -1
79 9991 13.195747 77.705557 2
80 9991 13.232903 77.695669 -1
[81 rows x 4 columns]
The column CLUSTER_LABEL will be used to join and get values from cgdf dataframe.
Add a new CLUSTER_LABEL column with proper cluster's label values to cgdf
cgdf["CLUSTER_LABEL"] = [0,1,2, -1]
Drop column 0 of cgdf
cgdf.drop(columns=[0], axis=1, inplace=True)
Check current cgdf
print(cgdf)
geometry CLUSTER_LABEL
0 POINT (12.856 77.799) 0
1 POINT (12.865 77.797) 1
2 POINT (13.198 77.707) 2
3 POINT EMPTY -1
Merge two dataframes into new dataframe dfnew2.
dfnew2 = dfnew.merge(cgdf, on='CLUSTER_LABEL')
Check current status of dfnew2, it should look like this:
user_id latitude longitude CLUSTER_LABEL geometry
0 55 13.263394 75.434141 -1 POINT EMPTY
1 55 13.263396 75.434138 -1 POINT EMPTY
2 356 12.809677 77.695516 -1 POINT EMPTY
3 356 12.809921 77.695234 -1 POINT EMPTY
4 356 12.810059 77.695263 -1 POINT EMPTY
.. ... ... ... ... ...
76 4594 13.198635 77.706593 2 POINT (13.198 77.707)
77 6886 13.196168 77.705323 2 POINT (13.198 77.707)
78 6886 13.199368 77.709566 2 POINT (13.198 77.707)
79 9896 13.201384 77.708284 2 POINT (13.198 77.707)
80 9991 13.195747 77.705557 2 POINT (13.198 77.707)
[81 rows x 5 columns]
'dfnew2' should be equivalent with the original dataframe with 2 additional special columns, 'CLUSTER_LABEL' and 'geometry' (of cluster's center point).
try:
from sklearn.tree import DecisionTreeClassifier
except:
pass
from sklearn.cluster import KMeans
def kmeans_centers(list_of_lats_lngs): #type of input list of lists
try:
data = pd.DataFrame([list_of_lats_lngs],columns=['lat','lng'])
data['eventType']= "test"
data.dropna(axis=0,how='any',subset=['lat','lng'],inplace=True)
X=data.loc[:,['eventType','lat','lng']]
K_clusters = range(1,10)
kmeans = [KMeans(n_clusters=i) for i in K_clusters]
Y_axis = data[['lat']]
X_axis = data[['lng']]
kmeans = KMeans(n_clusters = 3, init ='k-means++')
kmeans.fit(X[X.columns[1:3]])
X['cluster_label'] = kmeans.fit_predict(X[X.columns[1:3]])
centers = kmeans.cluster_centers_ # Coordinates of cluster centers.
# labels = kmeans.predict(X[X.columns[1:3]]) # Labels of each point
return centers
except Exception as e:
print("kmeans - CLustering exception",e)
return None
Ready to use
Input
[[12.02,12.34],[12.12,12.04],[12.092,12.74],[22.02,13.34]]
Pandas has the very handy function to do pairwise correlation of columns using pd.corr().
That means it is possible to compare correlations between columns of any length. For instance:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 10)))
0 1 2 3 4 5 6 7 8 9
0 9 17 55 32 7 97 61 47 48 46
1 8 83 87 56 17 96 81 8 87 0
2 60 29 8 68 56 63 81 5 24 52
3 42 76 6 75 7 59 19 17 3 63
...
Now it is possible to test correlation between all 10 columns with df.corr(method='pearson'):
0 1 2 3 4 5 6 7 8 9
0 1.000000 0.082789 -0.094096 -0.086091 0.163091 0.013210 0.167204 -0.002514 0.097481 0.091020
1 0.082789 1.000000 0.027158 -0.080073 0.056364 -0.050978 -0.018428 -0.014099 -0.135125 -0.043797
2 -0.094096 0.027158 1.000000 -0.102975 0.101597 -0.036270 0.202929 0.085181 0.093723 -0.055824
3 -0.086091 -0.080073 -0.102975 1.000000 -0.149465 0.033130 -0.020929 0.183301 -0.003853 -0.062889
4 0.163091 0.056364 0.101597 -0.149465 1.000000 -0.007567 -0.017212 -0.086300 0.177247 -0.008612
5 0.013210 -0.050978 -0.036270 0.033130 -0.007567 1.000000 -0.080148 -0.080915 -0.004612 0.243713
6 0.167204 -0.018428 0.202929 -0.020929 -0.017212 -0.080148 1.000000 0.135348 0.070330 0.008170
7 -0.002514 -0.014099 0.085181 0.183301 -0.086300 -0.080915 0.135348 1.000000 -0.114413 -0.111642
8 0.097481 -0.135125 0.093723 -0.003853 0.177247 -0.004612 0.070330 -0.114413 1.000000 -0.153564
9 0.091020 -0.043797 -0.055824 -0.062889 -0.008612 0.243713 0.008170 -0.111642 -0.153564 1.000000
Is there a simple way to also get the corresponding p-values (ideally in pandas), as it is returned e.g. by scipy's kendalltau()?
Why not using the "method" argument of pandas.DataFrame.corr():
pearson : standard correlation coefficient.
kendall : Kendall Tau correlation coefficient.
spearman : Spearman rank correlation.
callable: callable with input two 1d ndarrays and returning a float.
from scipy.stats import kendalltau, pearsonr, spearmanr
def kendall_pval(x,y):
return kendalltau(x,y)[1]
def pearsonr_pval(x,y):
return pearsonr(x,y)[1]
def spearmanr_pval(x,y):
return spearmanr(x,y)[1]
and then
corr = df.corr(method=pearsonr_pval)
Probably just loop. It's basically what pandas does in the source code to generate the correlation matrix anyway:
import pandas as pd
import numpy as np
from scipy import stats
df_corr = pd.DataFrame() # Correlation matrix
df_p = pd.DataFrame() # Matrix of p-values
for x in df.columns:
for y in df.columns:
corr = stats.pearsonr(df[x], df[y])
df_corr.loc[x,y] = corr[0]
df_p.loc[x,y] = corr[1]
If you want to leverage the fact that this is symmetric, so you only need to calculate this for roughly half of them, then do:
mat = df.values.T
K = len(df.columns)
correl = np.empty((K,K), dtype=float)
p_vals = np.empty((K,K), dtype=float)
for i, ac in enumerate(mat):
for j, bc in enumerate(mat):
if i > j:
continue
else:
corr = stats.pearsonr(ac, bc)
#corr = stats.kendalltau(ac, bc)
correl[i,j] = corr[0]
correl[j,i] = corr[0]
p_vals[i,j] = corr[1]
p_vals[j,i] = corr[1]
df_p = pd.DataFrame(p_vals)
df_corr = pd.DataFrame(correl)
#pd.concat([df_corr, df_p], keys=['corr', 'p_val'])
This will work:
from scipy.stats import pearsonr
column_values = [column for column in df.columns.tolist() ]
df['Correlation_coefficent'], df['P-value'] = zip(*df.T.apply(lambda x: pearsonr(x[column_values ],x[column_values ])))
df_result = df[['Correlation_coefficent','P-value']]
Does this work for you?
#call the correlation function, you could round the values if needed
df_c = df_c.corr().round(1)
#get the p values
pval = df_c.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(*rho.shape)
#set the p values, *** for less than 0.001, ** for less than 0.01, * for less than 0.05
p = pval.applymap(lambda x: ''.join(['*' for t in [0.001,0.01,0.05] if x<=t]))
#dfc_2 below will give you the dataframe with correlation coefficients and p values
df_c2 = df_c.astype(str) + p
#you could also plot the correlation matrix using sns.heatmap if you want
#plot the triangle
matrix = np.triu(df_c.corr())
#convert to array for the heatmap
df_c3 = df_c2.to_numpy()
#plot the heatmap
plt.figure(figsize=(13,8))
sns.heatmap(df_c, annot = df_c3, fmt='', vmin=-1, vmax=1, center= 0, cmap= 'coolwarm', mask = matrix)