I'm trying to find the convex hull of a series of points based on two columns of a pandas dataframe.
My current code is:
# Create column of point co-ordinates
df['xy'] = df.apply(lambda x: [x['col_1'], x['col_2']], axis=1)
# Return a numpy array of the point coordinates
point_list = df.xy.values
# pass the list to ConvexHull (imported using: from scipy.spatial import ConvexHull)
hull = ConvexHull(point_list)
I get this error when I run:
Traceback (most recent call last):
File "<ipython-input-41-517201a29182>", line 1, in <module>
hull = ConvexHull(point_list)
File "qhull.pyx", line 2220, in scipy.spatial.qhull.ConvexHull.__init__ (scipy\spatial\qhull.c:19058)
File "C:\Users\****\AppData\Local\Continuum\Anaconda\lib\site- packages\numpy\core\numeric.py", line 550, in ascontiguousarray
return array(a, dtype, copy=False, order='C', ndmin=1)
ValueError: setting an array element with a sequence.
Any thoughts on this?
Best Regards,
What you're doing looks overly complicated, you can pass df columns directly to ConvexHull:
In [311]:
from scipy.spatial import ConvexHull
df = pd.DataFrame({'col_1':np.random.randn(30), 'col_2':np.random.randn(30), 'col3':0})
df
Out[311]:
col3 col_1 col_2
0 0 0.837349 1.526832
1 0 -0.282778 -0.150751
2 0 -0.331192 -0.382630
3 0 -0.933054 -0.234423
4 0 1.074336 -1.180293
5 0 0.296417 0.626924
6 0 0.806266 -0.501335
7 0 -1.192482 -1.793160
8 0 0.920646 1.377393
9 0 -1.255671 0.428256
10 0 -1.518031 0.888582
11 0 1.231974 0.566314
12 0 -0.717847 -0.236354
13 0 0.758947 -0.286670
14 0 -1.546001 1.774912
15 0 -0.707825 -0.529058
16 0 0.446111 0.406430
17 0 0.711017 0.774281
18 0 -2.616337 0.293725
19 0 -0.370344 -0.471336
20 0 -0.281950 -0.243941
21 0 -1.088772 -1.471154
22 0 -0.422274 -0.266592
23 0 0.423735 -0.341429
24 0 1.166969 -0.329791
25 0 0.689842 1.143460
26 0 0.462430 -0.843409
27 0 3.071030 1.615058
28 0 -0.812258 0.272436
29 0 0.707237 -1.717054
Then I can pass the columns directly:
hull = ConvexHull(df[['col_1','col_2']])
import matplotlib.pyplot as plt
plt.plot(df['col_1'], df['col_2'], 'o')
for simplex in hull.simplices:
plt.plot(df['col_1'].iloc[simplex], df['col_2'].iloc[simplex], 'k-')
Which produces this plot:
Related
I am trying to use a Canadian gridded historical dataset of temperature anomalies but it seems that I don't have the skills to pull that off. The grd file are temperatures anomalies on what I believe is a highly regular grid. I have no experience with that kind of grid and I am having trouble building the xarray dataset.
What I have (a subset of the grd and the text file is accessible here) :
2075 '.grd' files ('t190001.grd' to 't202112.grd' following "t{year}{month}.grd" structure)
1 txt file listing the grid coordinates called "CANGRD_points_LL.txt"
From this I would like to build a xarray dataset in order to do some analysis.
Naively, I thought the grid files were already georeferenced and all so I started by doing this :
import glob
import rioxarray as rio
import pandas as pd
import numpy as np
import xarray as xr
#not used for the moment even though I believe that will be needed
#df = pd.read_csv(r"CANGRD_points_LL.txt", sep = ' ', header=None)
list_files = sorted(set(glob.glob(r"t?????[0-2].grd" ) + glob.glob(r"t????0[0-9].grd" )))
times = pd.date_range("1900/01/01",freq='M', periods= len(list_files))
datarrays = [rio.open_rasterio(rst, masked=True,band_as_variable=True).assign_coords(time = t).expand_dims(dim='time').squeeze() for rst,t in zip(list_files, times)]
ds = xr.concat(datarrays,dim='time').rename({'band_1' : 'tas', 'y': 'lat', 'x' : 'lon'})
But as I plotted the results it became evident that my coordinates were only the indices of the pixels :
So I believe I have to use the txt file provided, however, I have no idea how to make the xarray grid using the grid's coordinates and how to make that match with my array obtained by loading a grid via rioxarray. Here is a sample, the complete file is available above. What baffles me is that most of the 11874 lines of the dataframe resulting from the txt file seem to be unique, so how could I fit an array of dimensions 125 lon by 95 lat into it.
0 1 2 3
0 0 0 40.0451 -129.8530
1 0 1 40.1780 -129.3650
2 0 2 40.3080 -128.8740
3 0 3 40.4348 -128.3801
4 0 4 40.5585 -127.8834
5 0 5 40.6790 -127.3840
6 0 6 40.7963 -126.8817
7 0 7 40.9104 -126.3768
8 0 8 41.0211 -125.8693
9 0 9 41.1286 -125.3591
10 0 10 41.2327 -124.8465
11 0 11 41.3335 -124.3314
12 0 12 41.4308 -123.8140
13 0 13 41.5247 -123.2942
14 0 14 41.6151 -122.7722
15 0 15 41.7020 -122.2481
16 0 16 41.7853 -121.7218
17 0 17 41.8651 -121.1936
18 0 18 41.9413 -120.6634
19 0 19 42.0139 -120.1313
20 0 20 42.0828 -119.5975
21 0 21 42.1481 -119.0620
22 0 22 42.2097 -118.5249
23 0 23 42.2675 -117.9863
24 0 24 42.3216 -117.4462
25 0 25 42.3720 -116.9049
26 0 26 42.4186 -116.3622
27 0 27 42.4614 -115.8185
28 0 28 42.5005 -115.2736
29 0 29 42.5357 -114.7279
30 0 30 42.5670 -114.1812
31 0 31 42.5946 -113.6338
32 0 32 42.6182 -113.0857
33 0 33 42.6381 -112.5371
34 0 34 42.6540 -111.9880
35 0 35 42.6661 -111.4385
36 0 36 42.6743 -110.8888
37 0 37 42.6786 -110.3389
38 0 38 42.6791 -109.7889
39 0 39 42.6757 -109.2390
40 0 40 42.6684 -108.6892
41 0 41 42.6572 -108.1397
42 0 42 42.6421 -107.5905
43 0 43 42.6232 -107.0417
44 0 44 42.6004 -106.4935
45 0 45 42.5738 -105.9459
46 0 46 42.5433 -105.3991
47 0 47 42.5090 -104.8531
48 0 48 42.4708 -104.3081
49 0 49 42.4289 -103.7640
Here is the view of one grid file loaded as xarray,
Any help would be greatly appreciated! Thank you so much
I directly asked on the Xarray Github discussion here is the original answer from Keewis:
https://github.com/pydata/xarray/discussions/7443#discussioncomment-4700261
The grid file contains stacked 2D coordinates, which I guess is due to the grid's original coordinate system not being aligned with the lat / lon axes.
To read the coordinates into 2D coordinates you can use:
df = pd.read_csv(r"CANGRD_points_LL.txt", sep=" ", header=None, names=["y", "x", "lat", "lon"])
grid = df.set_index(["y", "x"]).to_xarray().set_coords(["lat", "lon"])
raw = xr.concat([...], dim="time")
ds = xr.merge([raw, grid]).assign_coords(time=times).rename_vars(...)
I know theres tons of similar question titles but none of them solved my particular question.
So I have this code:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
# my_list contains 983 list items
df = pd.DataFrame(np.array(my_list), columns=list('ABCDEF'))
df contains 983 items composed of lists of list
df.head()
A B C D E F
0 47 5 17 16 57 58
1 6 23 34 21 46 37
2 57 5 53 42 18 55
3 43 24 36 16 39 22
4 32 53 5 18 34 29
scaler = StandardScaler().fit(df.values)
transformed_dataset = scaler.transform(df.values)
transformed_df = pd.DataFrame(data=transformed_dataset, index=df.index)
number_of_rows = df.values.shape[0] # all our lists
window_length = 983 # amount of past number list we need to take in consideration for prediction
number_of_features = df.values.shape[1] # number count
train = np.empty([number_of_rows-window_length, window_length, number_of_features], dtype=float)
label = np.empty([number_of_rows-window_length, number_of_features], dtype=float)
window_length = 982
for i in range(0, number_of_rows-window_length):
train[i]=transformed_df.iloc[i:i+window_length,0:number_of_features]
label[i]=transformed_df.iloc[i:i+window_length:i+window_length+1,0:number_of_features]
train.shape
(0, 983, 6)
label.shape
(0, 6)
train[0] is working fine but when I do train[1] I got this error:
train[1]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-43-e73aed9430c6> in <module>
----> 1 train[1]
IndexError: index 1 is out of bounds for axis 0 with size 0
also when I do label[0], its fine. but when I do label[1] I got this error:
label[1]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-45-1e13a70afa10> in <module>
----> 1 label[1]
IndexError: index 1 is out of bounds for axis 0 with size 0
how to fix IndexErrors
You're creating an array whose first dimension has size 0 - that's why you're getting these errors
You're using the value number_of_rows - window_length for the first dimension - which is 0. I guess that's not what you want.
Pandas has the very handy function to do pairwise correlation of columns using pd.corr().
That means it is possible to compare correlations between columns of any length. For instance:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 10)))
0 1 2 3 4 5 6 7 8 9
0 9 17 55 32 7 97 61 47 48 46
1 8 83 87 56 17 96 81 8 87 0
2 60 29 8 68 56 63 81 5 24 52
3 42 76 6 75 7 59 19 17 3 63
...
Now it is possible to test correlation between all 10 columns with df.corr(method='pearson'):
0 1 2 3 4 5 6 7 8 9
0 1.000000 0.082789 -0.094096 -0.086091 0.163091 0.013210 0.167204 -0.002514 0.097481 0.091020
1 0.082789 1.000000 0.027158 -0.080073 0.056364 -0.050978 -0.018428 -0.014099 -0.135125 -0.043797
2 -0.094096 0.027158 1.000000 -0.102975 0.101597 -0.036270 0.202929 0.085181 0.093723 -0.055824
3 -0.086091 -0.080073 -0.102975 1.000000 -0.149465 0.033130 -0.020929 0.183301 -0.003853 -0.062889
4 0.163091 0.056364 0.101597 -0.149465 1.000000 -0.007567 -0.017212 -0.086300 0.177247 -0.008612
5 0.013210 -0.050978 -0.036270 0.033130 -0.007567 1.000000 -0.080148 -0.080915 -0.004612 0.243713
6 0.167204 -0.018428 0.202929 -0.020929 -0.017212 -0.080148 1.000000 0.135348 0.070330 0.008170
7 -0.002514 -0.014099 0.085181 0.183301 -0.086300 -0.080915 0.135348 1.000000 -0.114413 -0.111642
8 0.097481 -0.135125 0.093723 -0.003853 0.177247 -0.004612 0.070330 -0.114413 1.000000 -0.153564
9 0.091020 -0.043797 -0.055824 -0.062889 -0.008612 0.243713 0.008170 -0.111642 -0.153564 1.000000
Is there a simple way to also get the corresponding p-values (ideally in pandas), as it is returned e.g. by scipy's kendalltau()?
Why not using the "method" argument of pandas.DataFrame.corr():
pearson : standard correlation coefficient.
kendall : Kendall Tau correlation coefficient.
spearman : Spearman rank correlation.
callable: callable with input two 1d ndarrays and returning a float.
from scipy.stats import kendalltau, pearsonr, spearmanr
def kendall_pval(x,y):
return kendalltau(x,y)[1]
def pearsonr_pval(x,y):
return pearsonr(x,y)[1]
def spearmanr_pval(x,y):
return spearmanr(x,y)[1]
and then
corr = df.corr(method=pearsonr_pval)
Probably just loop. It's basically what pandas does in the source code to generate the correlation matrix anyway:
import pandas as pd
import numpy as np
from scipy import stats
df_corr = pd.DataFrame() # Correlation matrix
df_p = pd.DataFrame() # Matrix of p-values
for x in df.columns:
for y in df.columns:
corr = stats.pearsonr(df[x], df[y])
df_corr.loc[x,y] = corr[0]
df_p.loc[x,y] = corr[1]
If you want to leverage the fact that this is symmetric, so you only need to calculate this for roughly half of them, then do:
mat = df.values.T
K = len(df.columns)
correl = np.empty((K,K), dtype=float)
p_vals = np.empty((K,K), dtype=float)
for i, ac in enumerate(mat):
for j, bc in enumerate(mat):
if i > j:
continue
else:
corr = stats.pearsonr(ac, bc)
#corr = stats.kendalltau(ac, bc)
correl[i,j] = corr[0]
correl[j,i] = corr[0]
p_vals[i,j] = corr[1]
p_vals[j,i] = corr[1]
df_p = pd.DataFrame(p_vals)
df_corr = pd.DataFrame(correl)
#pd.concat([df_corr, df_p], keys=['corr', 'p_val'])
This will work:
from scipy.stats import pearsonr
column_values = [column for column in df.columns.tolist() ]
df['Correlation_coefficent'], df['P-value'] = zip(*df.T.apply(lambda x: pearsonr(x[column_values ],x[column_values ])))
df_result = df[['Correlation_coefficent','P-value']]
Does this work for you?
#call the correlation function, you could round the values if needed
df_c = df_c.corr().round(1)
#get the p values
pval = df_c.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(*rho.shape)
#set the p values, *** for less than 0.001, ** for less than 0.01, * for less than 0.05
p = pval.applymap(lambda x: ''.join(['*' for t in [0.001,0.01,0.05] if x<=t]))
#dfc_2 below will give you the dataframe with correlation coefficients and p values
df_c2 = df_c.astype(str) + p
#you could also plot the correlation matrix using sns.heatmap if you want
#plot the triangle
matrix = np.triu(df_c.corr())
#convert to array for the heatmap
df_c3 = df_c2.to_numpy()
#plot the heatmap
plt.figure(figsize=(13,8))
sns.heatmap(df_c, annot = df_c3, fmt='', vmin=-1, vmax=1, center= 0, cmap= 'coolwarm', mask = matrix)
I have a dataframe which i want to make a scatter plot of.
the dataframe looks like:
year length Animation
0 1971 121 1
1 1939 71 1
2 1941 7 0
3 1996 70 1
4 1975 71 0
I want the points in my scatter plot to be a different color depending the value in the Animation row.
So animation = 1 = yellow
animation = 0 = black
or something similiar
I tried doing the following:
dfScat = df[['year','length', 'Animation']]
dfScat = dfScat.loc[dfScat.length < 200]
axScat = dfScat.plot(kind='scatter', x=0, y=1, alpha=1/15, c=2)
This results in a slider which makes it hard to tell the difference.
You can also assign discrete colors to the points by passing an array to c=
Like this:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
d = {"year" : (1971, 1939, 1941, 1996, 1975),
"length" : ( 121, 71, 7, 70, 71),
"Animation" : ( 1, 1, 0, 1, 0)}
df = pd.DataFrame(d)
print(df)
colors = np.where(df["Animation"]==1,'y','k')
df.plot.scatter(x="year",y="length",c=colors)
plt.show()
This gives:
Animation length year
0 1 121 1971
1 1 71 1939
2 0 7 1941
3 1 70 1996
4 0 71 1975
Use the c parameter in scatter
df.plot.scatter('year', 'length', c='Animation', colormap='jet')
I would like to plot an histogram representing the value TP on the y axis and the method on the x axis. In particular I would like to obtain different figures according to the value of the column 'data'.
In this case I want a first histogram with values 2,1,6,9,8,1,0 and a second histogram with values 10,10,16,...
The python version of ggplot seems to be slightly different by the R ones.
FN FP TN TP data method
method
SS0208 18 0 80 2 A p=100 n=100 SNR=0.5 SS0208
SS0408 19 0 80 1 A p=100 n=100 SNR=0.5 SS0408
SS0206 14 9 71 6 A p=100 n=100 SNR=0.5 SS0206
SS0406 11 6 74 9 A p=100 n=100 SNR=0.5 SS0406
SS0506 12 6 74 8 A p=100 n=100 SNR=0.5 SS0506
SS0508 19 0 80 1 A p=100 n=100 SNR=0.5 SS0508
LKSC 20 0 80 0 A p=100 n=100 SNR=0.5 LKSC
SS0208 10 1 79 10 A p=100 n=100 SNR=10 SS0208
SS0408 10 0 80 10 A p=100 n=100 SNR=10 SS0408
SS0206 4 5 75 16 A p=100 n=100 SNR=10 SS0206
As a first step I have tried to plot only one histogram and I received an error.
df = df[df.data == df.data.unique()[0]]
In [65]: ggplot() + geom_bar(df, aes(x='method', y='TP'), stat='identity')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-65-dd47b8d85375> in <module>()
----> 1 ggplot() + geom_bar(df, aes(x='method', y='TP'), stat='identity')
TypeError: __init__() missing 2 required positional arguments: 'aesthetics' and 'data'w
In [66]:
I have tried different combinations of commands but I did not solve.
Once that this first problem has been solved I would like the histograms grouped according to the value of 'data'. This could probably be done by 'facet_wrap'
This is probably because you called ggplot() without an argument (Not sure if that should be possible. If you think so, please add a issue on http://github.com/yhat/ggplot).
Anyway, this should work:
ggplot(df, aes(x='method', y='TP')) + geom_bar(stat='identity')
Unfortunately, faceting with geom_bar doesn't work yet properly (only when all facets have all levels/ x values!) -> Bugreport