KDTree duplicating rows - python

I have two dataframes.
The first dataframe (map) consists of two columns: "X" and "Y". map is 83150 rows.
The second data frame (cords) consists of two columns: "X Rotate" and "Y Rotate". coords is 2702 rows.
The objective is to find the nearest neighbor for each (X,Y) coordinate within map to the (X Rotate, Y Rotate) coordinates within coords.
In order to do this, I duplicate each row within coords 31 times because of 83150/2702. Now, coords has 83762 rows. This means each (X,Y) coordinate will find its nearest neighbor to the (X Rotate, Y Rotate) and there will be 612 coordinates within coords that will not have a nearest neighbor match.
This is the function to make this happen:
def nearest_neighbors(df, map):
num_pts = math.ceil(map.shape[0] / df.shape[0])
map = map[["X", "Y"]].to_numpy()
duplicate_cords_df = pd.DataFrame(np.repeat(df.values, num_pts, axis=0), columns=df.columns)
duplicate_cords_sub = duplicate_cords_df[["X Rotate", "Y Rotate"]].to_numpy()
duplicate_cords_sub = duplicate_cords_sub.to_numpy()
list_of_dicts = []
for row in map:
map_tree = spatial.cKDTree(duplicate_cords_sub)
distance, index = map_tree.query(row)
cols = ["Map X", "Map Y", "X Rotate", "Y Rotate", "Distance"]
map_x = row[0]
map_y = row[1]
coords_x = (duplicate_cords_sub[index]).flat[0]
coords_y = (duplicate_cords_sub[index]).flat[1]
results = [map_x, map_y, coords_x, coords_y, distance]
results_dict = dict(zip(cols, results))
list_of_dicts.append(results_dict)
results_df = pd.DataFrame(list_of_dicts)
return results_df
However, when I check the count for the number of duplicates in results_df, I notice that each (X Rotate, Y Rotate) coordinate is being used a varying number of times.
overall_df_dup = results_df.groupby(['X Rotate', 'Y Rotate']).size().reset_index(name='count')
print(overall_df_dup)
X Rotate Y Rotate count
0 -74.25 0.00 16
1 -72.48 -12.37 28
2 -72.48 -8.84 37
3 -72.48 -5.30 43
4 -72.48 -1.77 39
... ... ... ...
2697 70.71 14.14 62
2698 72.48 -8.84 45
2699 72.48 -1.77 55
2700 72.48 1.77 47
2701 72.48 5.30 48
I checked the duplicates count of the data frame given to the KDTree function and it was correct:
coords_dup = duplicate_cords.groupby(['X Rotate', 'Y Rotate']).size().reset_index(name='count')
print(coords_dup)
X Rotate Y Rotate count
0 -74.25 -0.00 31
1 -72.48 -12.37 31
2 -72.48 -8.84 31
3 -72.48 -5.30 31
4 -72.48 -1.77 31
... ... ... ...
2697 70.71 14.14 31
2698 72.48 -8.84 31
2699 72.48 -1.77 31
2700 72.48 1.77 31
2701 72.48 5.30 31
How does the resulting df contain more duplicates of the coordinates than existing in the original data frame fed into the KdTree function?
Bonus question: Is it possible to have each (X Rotate, Y Rotate) coordinate be mapped to 30 times and only some (X Rotate, Y Rotate) be mapped to 31 times? Ideally, I want each (X Rotate, Y Rotate) coordinate to be mapped to 30 times regardless.

Probably not the right answer in first intention however this can help using KDTree
Create a minimal reproducible example:
import pandas as pd
import numpy as np
from scipy.spatial import cKDTree
gen_coords = lambda s: np.round(np.random.randint(-100, 100, s) \
+ np.random.random(s), 2)
df_map = pd.DataFrame(gen_coords((83150, 2)), columns=['X', 'Y'])
df_coords = pd.DataFrame(gen_coords((2702, 2)), columns=['X Rotate', 'Y Rotate'])
Map coordinates:
df_coords['IDX'] = cKDTree(df_map).query(df_coords, k=30)[1].tolist()
df_coords = df_coords.explode('IDX')
df_coords[['X', 'Y']] = df_map.loc[df_coords['IDX'].tolist()].values
df_coords = df_coords.drop(columns='IDX')
Output result:
>>> df_coords
X Rotate Y Rotate X Y
0 99.00 57.35 99.18 57.13
0 99.00 57.35 98.54 57.53
0 99.00 57.35 99.14 58.20
0 99.00 57.35 99.88 57.36
0 99.00 57.35 98.03 56.94
... ... ... ... ...
2701 92.75 -8.69 91.40 -9.74
2701 92.75 -8.69 91.75 -7.29
2701 92.75 -8.69 93.41 -7.09
2701 92.75 -8.69 94.48 -8.78
2701 92.75 -8.69 93.29 -10.36
[81060 rows x 4 columns]
>>> df_coords.value_counts(['X Rotate', 'Y Rotate'])
X Rotate Y Rotate
-99.71 -20.20 30
35.72 85.56 30
34.64 76.37 30
34.76 8.32 30
34.90 -4.75 30
..
-32.69 -44.76 30
-32.66 72.96 30
-32.63 -40.65 30
-32.61 34.91 30
99.89 98.02 30
Length: 2702, dtype: int64

Related

Do an operation only if values from same column of two dataframes are the same

I have a func_df with 4 functions:
x y1 y2 y3 y4
0 -20.0 -0.839071 10.0 0.816164 -8795.000
1 -19.9 -0.865213 9.9 0.994372 -8667.619
2 -19.8 -0.889191 9.8 1.162644 -8541.472
3 -19.7 -0.910947 9.7 1.319299 -8416.553
4 -19.6 -0.930426 9.6 1.462772 -8292.856
.. ... ... ... ... ...
395 19.5 -0.947580 9.5 1.591630 6659.375
396 19.6 -0.930426 9.6 1.462772 6766.216
397 19.7 -0.910947 9.7 1.319299 6874.193
398 19.8 -0.889191 9.8 1.162644 6983.312
399 19.9 -0.865213 9.9 0.994372 7093.579
And a test_df with scatter points:
x y
0 -6.2 0.360801
1 6.4 -3.655422
2 -17.6 -6065.659700
3 -1.5 -3.247304
4 -17.7 -0.785430
.. ... ...
95 1.6 3.722551
96 16.3 -1.067487
97 -13.3 1.857445
98 -3.8 -0.008831
99 -13.2 1.294064
I want to find the deviation(distance) between all the scatter points and the 4 functions when the x-value is the same on both data frames.
There are some scatter points with same x-value and different y-value.
Edit: A quick example:
Starting with column y1 from func_df:
1st value is x = -20.0 , y1 = -0.839071.
I want the program to search if there a row in which x = -20.0 in test_df and if so, then find the difference between the y-value of that row and the y-value of func_df, which is -0.839071.
Imagine that in test_df there is a row with x = -20, y = -1. Then what I want is abs(-1 - 0.839071).
I used abs() because the distance has to be a positive value
This was for the row 0 of column y1. I need it for all rows and also for y2, y3 and y4 of func_df.
I tried something like this:
if test_df.x.equals(func.x):
result_df = func.iloc[:, 1:5].apply(lambda cell: cell - test_df.y[cell.index])
But honestly was a shot in the dark, no idea what I'm doing.
create sample data:
func_df=pd.DataFrame(data={'x':[-20.9,-20.8,-20.7,-20.6],'y1':[-0.12,-0.021,-0.04,-0.91],
'y2':[10.0,9.9,9.8,9.7],'y3':[0.99437,1.162644,1.319299,1.462772],
'y4':[-8667.619,-8541.472,-8416.553,-8292.856]})
print(func_df)
'''
x y1 y2 y3 y4
0 -20.9 -0.120 10.0 0.994370 -8667.619
1 -20.8 -0.021 9.9 1.162644 -8541.472
2 -20.7 -0.040 9.8 1.319299 -8416.553
3 -20.6 -0.910 9.7 1.462772 -8292.856
'''
test_df=pd.DataFrame(data={'x':[-20.9,-15.2],'y':[0.360801,-3.655422]})
print(test_df)
'''
x y
0 -20.9 0.360801
1 -15.2 -3.655422
'''
as you can see the first rows match in both dfs. Now let's combine these two df's. Use this to get only matching rows.
final=func_df.merge(test_df,on='x')
print(final)
'''
x y1 y2 y3 y4 y
0 -20.9 -0.12 10.0 0.99437 -8667.619 0.360801
'''
#if you want to see all values use final=func_df.merge(test_df,how='left',on='x')
Now that we have the matching rows and the y value in func_df we can do the calculation.
loop_cols=[*final.columns[1:5]] #['y1', 'y2', 'y3', 'y4']
for i in loop_cols:
final['distance_{}'.format(i)]=abs(final['y'] - final[i])
print(final)
'''
x y1 y2 y3 y4 y distance_y1 distance_y2 distance_y3 distance_y4
0 -20.9 -0.12 10.0 0.99437 -8667.619 0.360801 0.480801 9.639199 0.633569 8667.979801000001
'''
If the values of 'x' in test_df are unique you could merge the two dataframes on 'x'
merged_df = pandas.merge(test_df, func_df, on='x')
abs_delta_y1 = (merged_df['y'] - merged_df['y1']).abs()
etc...
I agree with Ray Pelletier - unfortunately I do not have enough credit to comment.
If you merge the two data frames on x, then you can create a new dataframe, where you, for each row in the merged dataframe, can calculate the difference between y and y1, y and y2, and so on.
If you read up on the merge function, there is a parameter you can set to "inner". Then the merged dataframe will only contain values of x which are present in both dataframes.

Sum of the column values if the rows meet the conditions

I am trying to calculate the sum of sales for stores in the same neighborhood based on their geographic coordinates. I have sample data:
data={'ID':['1','2','3','4'],'SALE':[100,120,110,95],'X':[23,22,21,24],'Y':[44,45,41,46],'X_MIN':[22,21,20,23],'Y_MIN':[43,44,40,45],'X_MAX':[24,23,22,25],'Y_MAX':[45,46,42,47]}
ID
SALE
X
Y
X_MIN
Y_MIN
X_MAX
Y_MAX
1
100
23
44
22
43
24
45
2
120
22
45
21
44
23
46
3
110
21
41
20
40
22
42
4
95
24
46
23
45
25
47
X and Y are the coordinates of the store. X and Y with MIN and MAX are the area they cover. For each row, I want to sum sales for all stores that are within the boundaries of the single store. I expect results similar to the table below where SUM for ID 1 is equal 220 because the coordinates (X and Y) are within the MIN and MAX limits of this store for ID 1 and ID 2 while for ID 4 only this one store is between his coordinates so the sum of sales is equal 95.
final={'ID':['1','2','3','4'],'SUM':[220,220,110,95]}
ID
SUM
1
220
2
220
3
110
4
95
What I've tried:
data['SUM'] = data.apply(lambda x: data['SALE'].sum(data[(data['X'] >= x['X_MIN'])&(data['X'] <= x['X_MAX'])&(data['Y'] >= x['Y_MIN'])&(data['Y'] <= x['Y_MAX'])]),axis=1)
Unfortunately the code does not work and I am getting the following error:
TypeError: unhashable type: 'DataFrame'
I am asking for help in solving this problem.
If you put the summation at the end, your solution works:
data['SUM'] = data.apply(lambda x: (data['SALE'][(data['X'] >= x['X_MIN'])&(data['X'] <= x['X_MAX'])&(data['Y'] >= x['Y_MIN'])&(data['Y'] <= x['Y_MAX'])]).sum(),axis=1)
###output of data['SUM']:
###0 220
###1 220
###2 110
###3 95

xarray set new 2D coordinate as dimension

I have an xarray dataset of sea surface temperature values on an x/y grid. x and y are 1D vector coordinates, so it looks like this minimal example:
<xarray.Dataset>
Dimensions: (x: 10, y: 10)
Coordinates:
* x (x) int64 0 1 2 3 4 5 6 7 8 9
* y (y) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
data (x, y) float64 0.559 0.01037 0.1562 ... 0.08778 0.3272 0.8661
I am able to compute the lat/lon from this x/y grid, and the output is 2 2D arrays. I can add them as coordinates with ds.assign_coords:
<xarray.Dataset>
Dimensions: (x: 10, y: 10)
Coordinates:
* x (x) int64 0 1 2 3 4 5 6 7 8 9
* y (y) int64 0 1 2 3 4 5 6 7 8 9
lat (x, y) float64 30.0 30.0 30.0 30.0 30.0 ... 39.0 39.0 39.0 39.0
lon (x, y) float64 -120.0 -119.0 -118.0 -117.0 ... -113.0 -112.0 -111.0
Data variables:
data (x, y) float64 0.559 0.01037 0.1562 ... 0.08778 0.3272 0.8661
But I'd like to .sel along slices of the lat/lon. This currently isn't possible, as I get the error:
ds.sel(lat=slice(32,36), lon=slice(-118, -115))
ValueError Traceback (most recent call last)
<ipython-input-20-28c79202d5f3> in <module>
----> 1 ds.sel(lat=slice(32,36), lon=slice(-118, -115))
~/.local/lib/python3.8/site-packages/xarray/core/dataset.py in sel(self, indexers, method, tolerance, drop, **indexers_kwargs)
2363 """
2364 indexers = either_dict_or_kwargs(indexers, indexers_kwargs, "sel")
-> 2365 pos_indexers, new_indexes = remap_label_indexers(
2366 self, indexers=indexers, method=method, tolerance=tolerance
2367 )
~/.local/lib/python3.8/site-packages/xarray/core/coordinates.py in remap_label_indexers(obj, indexers, method, tolerance, **indexers_kwargs)
419 }
420
--> 421 pos_indexers, new_indexes = indexing.remap_label_indexers(
422 obj, v_indexers, method=method, tolerance=tolerance
423 )
~/.local/lib/python3.8/site-packages/xarray/core/indexing.py in remap_label_indexers(data_obj, indexers, method, tolerance)
256 new_indexes = {}
257
--> 258 dim_indexers = get_dim_indexers(data_obj, indexers)
259 for dim, label in dim_indexers.items():
260 try:
~/.local/lib/python3.8/site-packages/xarray/core/indexing.py in get_dim_indexers(data_obj, indexers)
222 ]
223 if invalid:
--> 224 raise ValueError(f"dimensions or multi-index levels {invalid!r} do not exist")
225
226 level_indexers = defaultdict(dict)
ValueError: dimensions or multi-index levels ['lat', 'lon'] do not exist
So my question is this: How can I change the dimensions of data to be (lon: (10,10), lat: (10,10)) instead of (x: 10, y: 10? Is this even possible?
Code to reproduce the example dataset:
import numpy as np
import xarray as xr
# Create sample data
data = np.random.rand(10,10)
x = y = np.arange(10)
# Set up dataset
ds = xr.Dataset(
data_vars = dict(
data = (["x", "y"], data)
),
coords= {
"x" : x,
"y" : y
}
)
# Create example lat/lon and assign to dataset
lon, lat = np.meshgrid(np.linspace(-120, -111, 10), np.linspace(30, 39, 10))
ds = ds.assign_coords({
"lat": (["x", "y"], lat),
"lon": (["x", "y"], lon)
})

Getting the center point of a cluster for latitude and longitude in Python

I have a list of of coordinates that have areas mapped out as follows
df=pd.DataFrame({'user_id':[55,55,356,356,356,356,632,752,938,963,963,1226,2663,2663,2663,2663,2663,3183,3197,3344,3387,3387,3387,3387,3396,3515,3536,3570,3819,3883,3883,3883,3883,3883,3883,3883,3883,3883,3883,3883,3883,4584,4594,4713,4931,4931,5026,5487,5487,5575,5575,5575,5602,5639,5639,5639,5639,5783,5783,5783,5783,5783,5801,6373,6718,6886,6886,7055,7055,7608,7608,7777,8186,8186,8307,8712,9271,9896,9991,9991,9991,],
'latitude':[13.2633943,13.2633964,12.809677124,12.8099212646,12.8100585938,12.810065981,12.9440132,12.2958104,12.5265661,13.0767648,13.0853577,12.6301221,12.8558120728,12.8558349609,12.8558654785,12.8558807373,12.8558959961,12.9141417,13.0696411133,13.0708333,10.7904833,10.7904833,10.7904833,12.884091,13.0694428,13.204637,12.6922086,13.0767648,13.3489958,12.8653798,12.8654014,12.8654124,12.8654448,12.8654521,12.8654658,12.8654733,12.8654815,12.8654844,12.8655367,12.8655376,12.865576,12.4025539,13.1986348,12.9548317,11.664325,11.6690603,13.0656551,13.1137554,13.1137978,12.770418,12.9141417,12.9141417,15.3530727,12.8285405054,12.8285925,12.8288406,12.829668,12.2958104,12.5583190918,12.7367172241,12.7399597168,12.7422103882,12.8631981,13.3378762,12.5638375681,13.1961683,13.1993678,12.1210997,12.5265661,13.1332778931,13.13331604,12.1210997,13.0649372,13.0658797,12.6955714,12.8213806152,13.0641708374,13.2013835,13.1154662,13.1957473755,13.2329025269,],
'longitude':[75.4341412,75.4341377,77.6955155017,77.6952344177,77.6952628334,77.6952629697,75.7926285,76.6393805,78.2149575,77.6397007,77.6445166,77.1145378,77.7985897361,77.7985953164,77.798622112,77.7985610742,77.7986275271,74.8559568,77.6520116309,77.6519444,78.7046725,78.7046725,78.7046725,74.8372421,77.6523596,77.6506622,78.6181131,77.6397007,74.7855559,77.7972191,77.7971733,77.7971429,77.7971621,77.7970823,77.7970327,77.7970371,77.7972272,77.7970335,77.7969649,77.796956,77.7971244,75.9811564,77.7065928,77.4739615,78.1460142,78.139311,77.4380296,77.5732437,77.573201,74.8609332,74.8559568,74.8559568,75.1386825,77.6891233027,77.6899376,77.6892531,77.6902955,76.6393805,77.7842363745,77.7841222429,77.7837989946,77.7830295359,77.4336428,77.117325,75.5833357573,77.7053231,77.7095658,78.1582143,78.2149575,77.5728687166,77.5729374436,78.1582143,77.7435873,77.7444939,78.0620963,77.6606095672,77.746332751,77.7082838,77.6069977,77.7055573788,77.6956690934,],
})
For the following latitude longitude pairs I am using DBSCAN to cluster them
X=np.array(df[['latitude', 'longitude']])
kms_per_radian = 6371.0088
epsilon = 1 / kms_per_radian
db = DBSCAN(eps=epsilon, min_samples=5)
model=db.fit(np.radians(X))
cluster_labels = db.labels_
num_clusters = len(set(cluster_labels))
cluster_labels = cluster_labels.astype(float)
cluster_labels[cluster_labels == -1] = np.nan
clusters = pd.Series( [X[cluster_labels==n] for n in range(num_clusters)] )
labels = pd.DataFrame(db.labels_,columns=['CLUSTER_LABEL'])
dfnew=pd.concat([df,labels],axis=1,sort=False)
How do I get the get the center point of these clusters and map it back to the dataset so that when I display the same in folium with a marker and the summary starts there?
So far I have tried
def get_centermost_point(cluster):
centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y)
centermost_point = min(cluster, key=lambda point: great_circle(point, centroid).m)
return tuple(centermost_point)
centermost_points = clusters.map(get_centermost_point)
which gives me a IndexError: list index out of range error
To get the coordinates of each cluster's centroid:
for ea in clusters:
print(MultiPoint(ea).centroid)
Outcome:
POINT (12.85585784912 77.79859915316)
POINT (12.86547048333333 77.79709629166666)
POINT (13.1982603551 77.70706457576)
POINT EMPTY
To create a geodataframe from the centroids and plot it.
(assuming the coordinates are long/lat)
# To create a geodataframe of the centroids
clusters_centroids = [MultiPoint(ea).centroid for ea in clusters]
crs = {'init': 'epsg:4326'}
cgdf = gpd.GeoDataFrame(clusters, crs=crs, geometry=clusters_centroids)
# Eliminate some empty row(s)
good_cdgf = cgdf[ ~cgdf['geometry'].is_empty ]
# plot to see the centroids
good_cdgf.plot()
The output plot:
To add the center points back into the original dataframe df.
Here I start with checking dfnew which is simply df with added column CLUSTER_LABEL.
print(dfnew)
user_id latitude longitude CLUSTER_LABEL
0 55 13.263394 75.434141 -1
1 55 13.263396 75.434138 -1
2 356 12.809677 77.695516 -1
3 356 12.809921 77.695234 -1
4 356 12.810059 77.695263 -1
.. ... ... ... ...
76 9271 13.064171 77.746333 -1
77 9896 13.201384 77.708284 2
78 9991 13.115466 77.606998 -1
79 9991 13.195747 77.705557 2
80 9991 13.232903 77.695669 -1
[81 rows x 4 columns]
The column CLUSTER_LABEL will be used to join and get values from cgdf dataframe.
Add a new CLUSTER_LABEL column with proper cluster's label values to cgdf
cgdf["CLUSTER_LABEL"] = [0,1,2, -1]
Drop column 0 of cgdf
cgdf.drop(columns=[0], axis=1, inplace=True)
Check current cgdf
print(cgdf)
geometry CLUSTER_LABEL
0 POINT (12.856 77.799) 0
1 POINT (12.865 77.797) 1
2 POINT (13.198 77.707) 2
3 POINT EMPTY -1
Merge two dataframes into new dataframe dfnew2.
dfnew2 = dfnew.merge(cgdf, on='CLUSTER_LABEL')
Check current status of dfnew2, it should look like this:
user_id latitude longitude CLUSTER_LABEL geometry
0 55 13.263394 75.434141 -1 POINT EMPTY
1 55 13.263396 75.434138 -1 POINT EMPTY
2 356 12.809677 77.695516 -1 POINT EMPTY
3 356 12.809921 77.695234 -1 POINT EMPTY
4 356 12.810059 77.695263 -1 POINT EMPTY
.. ... ... ... ... ...
76 4594 13.198635 77.706593 2 POINT (13.198 77.707)
77 6886 13.196168 77.705323 2 POINT (13.198 77.707)
78 6886 13.199368 77.709566 2 POINT (13.198 77.707)
79 9896 13.201384 77.708284 2 POINT (13.198 77.707)
80 9991 13.195747 77.705557 2 POINT (13.198 77.707)
[81 rows x 5 columns]
'dfnew2' should be equivalent with the original dataframe with 2 additional special columns, 'CLUSTER_LABEL' and 'geometry' (of cluster's center point).
try:
from sklearn.tree import DecisionTreeClassifier
except:
pass
from sklearn.cluster import KMeans
def kmeans_centers(list_of_lats_lngs): #type of input list of lists
try:
data = pd.DataFrame([list_of_lats_lngs],columns=['lat','lng'])
data['eventType']= "test"
data.dropna(axis=0,how='any',subset=['lat','lng'],inplace=True)
X=data.loc[:,['eventType','lat','lng']]
K_clusters = range(1,10)
kmeans = [KMeans(n_clusters=i) for i in K_clusters]
Y_axis = data[['lat']]
X_axis = data[['lng']]
kmeans = KMeans(n_clusters = 3, init ='k-means++')
kmeans.fit(X[X.columns[1:3]])
X['cluster_label'] = kmeans.fit_predict(X[X.columns[1:3]])
centers = kmeans.cluster_centers_ # Coordinates of cluster centers.
# labels = kmeans.predict(X[X.columns[1:3]]) # Labels of each point
return centers
except Exception as e:
print("kmeans - CLustering exception",e)
return None
Ready to use
Input
[[12.02,12.34],[12.12,12.04],[12.092,12.74],[22.02,13.34]]

How to calculate p-values for pairwise correlation of columns in Pandas?

Pandas has the very handy function to do pairwise correlation of columns using pd.corr().
That means it is possible to compare correlations between columns of any length. For instance:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 10)))
0 1 2 3 4 5 6 7 8 9
0 9 17 55 32 7 97 61 47 48 46
1 8 83 87 56 17 96 81 8 87 0
2 60 29 8 68 56 63 81 5 24 52
3 42 76 6 75 7 59 19 17 3 63
...
Now it is possible to test correlation between all 10 columns with df.corr(method='pearson'):
0 1 2 3 4 5 6 7 8 9
0 1.000000 0.082789 -0.094096 -0.086091 0.163091 0.013210 0.167204 -0.002514 0.097481 0.091020
1 0.082789 1.000000 0.027158 -0.080073 0.056364 -0.050978 -0.018428 -0.014099 -0.135125 -0.043797
2 -0.094096 0.027158 1.000000 -0.102975 0.101597 -0.036270 0.202929 0.085181 0.093723 -0.055824
3 -0.086091 -0.080073 -0.102975 1.000000 -0.149465 0.033130 -0.020929 0.183301 -0.003853 -0.062889
4 0.163091 0.056364 0.101597 -0.149465 1.000000 -0.007567 -0.017212 -0.086300 0.177247 -0.008612
5 0.013210 -0.050978 -0.036270 0.033130 -0.007567 1.000000 -0.080148 -0.080915 -0.004612 0.243713
6 0.167204 -0.018428 0.202929 -0.020929 -0.017212 -0.080148 1.000000 0.135348 0.070330 0.008170
7 -0.002514 -0.014099 0.085181 0.183301 -0.086300 -0.080915 0.135348 1.000000 -0.114413 -0.111642
8 0.097481 -0.135125 0.093723 -0.003853 0.177247 -0.004612 0.070330 -0.114413 1.000000 -0.153564
9 0.091020 -0.043797 -0.055824 -0.062889 -0.008612 0.243713 0.008170 -0.111642 -0.153564 1.000000
Is there a simple way to also get the corresponding p-values (ideally in pandas), as it is returned e.g. by scipy's kendalltau()?
Why not using the "method" argument of pandas.DataFrame.corr():
pearson : standard correlation coefficient.
kendall : Kendall Tau correlation coefficient.
spearman : Spearman rank correlation.
callable: callable with input two 1d ndarrays and returning a float.
from scipy.stats import kendalltau, pearsonr, spearmanr
def kendall_pval(x,y):
return kendalltau(x,y)[1]
def pearsonr_pval(x,y):
return pearsonr(x,y)[1]
def spearmanr_pval(x,y):
return spearmanr(x,y)[1]
and then
corr = df.corr(method=pearsonr_pval)
Probably just loop. It's basically what pandas does in the source code to generate the correlation matrix anyway:
import pandas as pd
import numpy as np
from scipy import stats
df_corr = pd.DataFrame() # Correlation matrix
df_p = pd.DataFrame() # Matrix of p-values
for x in df.columns:
for y in df.columns:
corr = stats.pearsonr(df[x], df[y])
df_corr.loc[x,y] = corr[0]
df_p.loc[x,y] = corr[1]
If you want to leverage the fact that this is symmetric, so you only need to calculate this for roughly half of them, then do:
mat = df.values.T
K = len(df.columns)
correl = np.empty((K,K), dtype=float)
p_vals = np.empty((K,K), dtype=float)
for i, ac in enumerate(mat):
for j, bc in enumerate(mat):
if i > j:
continue
else:
corr = stats.pearsonr(ac, bc)
#corr = stats.kendalltau(ac, bc)
correl[i,j] = corr[0]
correl[j,i] = corr[0]
p_vals[i,j] = corr[1]
p_vals[j,i] = corr[1]
df_p = pd.DataFrame(p_vals)
df_corr = pd.DataFrame(correl)
#pd.concat([df_corr, df_p], keys=['corr', 'p_val'])
This will work:
from scipy.stats import pearsonr
column_values = [column for column in df.columns.tolist() ]
df['Correlation_coefficent'], df['P-value'] = zip(*df.T.apply(lambda x: pearsonr(x[column_values ],x[column_values ])))
df_result = df[['Correlation_coefficent','P-value']]
Does this work for you?
#call the correlation function, you could round the values if needed
df_c = df_c.corr().round(1)
#get the p values
pval = df_c.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(*rho.shape)
#set the p values, *** for less than 0.001, ** for less than 0.01, * for less than 0.05
p = pval.applymap(lambda x: ''.join(['*' for t in [0.001,0.01,0.05] if x<=t]))
#dfc_2 below will give you the dataframe with correlation coefficients and p values
df_c2 = df_c.astype(str) + p
#you could also plot the correlation matrix using sns.heatmap if you want
#plot the triangle
matrix = np.triu(df_c.corr())
#convert to array for the heatmap
df_c3 = df_c2.to_numpy()
#plot the heatmap
plt.figure(figsize=(13,8))
sns.heatmap(df_c, annot = df_c3, fmt='', vmin=-1, vmax=1, center= 0, cmap= 'coolwarm', mask = matrix)

Categories