closest points based on coordinates, python - python

I have a list of the station with x and y coordinates. I tried to find at least 4 closest points for each station. I had a look at this link but can not able to figure out how to do that.
for example, my data looks like:
station Y X
601 28.47 83.43
604 28.45 83.42
605 28.16 83.36
606 28.29 83.39
607 28.38 83.36
608 28.49 83.53
609 28.21 83.34
610 29.03 83.53
612 29.11 83.58
613 28.11 83.45
614 28.13 83.42
615 282.4 83.06
616 28.36 83.13
619 28.24 83.44
620 28.02 83.39
621 28.23 83.24
622 28.09 83.34
623 29.06 84
624 28.58 83.47
625 28.54 83.41
626 28.28 83.36
627 28.23 83.29
628 28.3 83.18
629 28.34 83.23
630 28.08 83.37
633 29.11 83.59
Any help will be highly appriciated.

For large-data, you might try to be clever in regards to data-stuctures. As already tagged by yourself, there are specialized data-structures for these kind of lookups. Scipy supports some, sklearn is even more complete (and imho better and more actively developed for these tasks; personal opinion)!
The code-example uses scipy's API to not use (python-)loops. The disadvantage is the need for discarding the 0-distance to itself for each element.
Code
import numpy as np
from scipy.spatial import KDTree
""" Data """
data_i = np.array([601, 604, 605, 606])
data = np.array([[28.47, 83.43],[28.45, 83.42],[28.16, 83.36],[82.29, 83.39]])
print(data_i)
print(data)
""" KDTree """
N_NEIGHBORS = 2
kdtree = KDTree(data)
kdtree_q = kdtree.query(data, N_NEIGHBORS+1) # 0-dist to self -> +1
print(data_i[kdtree_q[1][:, 1:]]) # discard 0-dist
# uses guarantee of sorted-by-dist
Output
[601 604 605 606]
[[ 28.47 83.43]
[ 28.45 83.42]
[ 28.16 83.36]
[ 82.29 83.39]]
[[604 605]
[601 605]
[604 601]
[601 604]]

Related

Which ML algorithm would be appropriate for clustering a combination of categorical and numerical dataframe?

I wish to cluster a DataFrame with a dimension of (120000 x 4).
It consists of two string-based "label" columns (str1 and str2), and two numerical columns which looks like the following:
Str1 Str2 Energy intensity
0 713 599 7678.159 5367.276014
1 715 598 7678.182 6576.100453
2 714 597 7678.183 5675.788001
3 684 587 7678.493 3040.650157
4 693 588 7678.585 5585.908164
5 695 586 7678.615 3184.001905
6 684 584 7678.674 4896.774505
7 799 509 7693.645 4907.484401
8 798 508 7693.754 4075.800912
9 797 507 7693.781 4407.800702
10 796 506 7694.043 3138.073328
11 794 505 7694.049 3653.699936
12 795 504 7694.077 3875.120022
13 675 277 7694.948 3081.797654
14 709 221 7698.216 3587.704908
15 708 220 7698.252 4070.050144
...........
What would be the best ML algorithm to cluster/categorize this data?
I have tried plotting individual energy&intensity components belonging to one specific category Str1== "713" etc, which didn't give me much information. I am in need of somewhat more compact clustering, if possible.
You can try to do categorical encoding or one-hot encoding to Str1 and Str2 (categorical encoding is suitable for the class with magnitude relation, while one-hot encoding is more widely used), these will convert the string into numerical data, can you can just simply use any regression model.

Using the principal components analysis to understand the data that we can remove

I would like to use the principal components analysis (PCA) method in python to understand what are the most important data to my machine learning model so I can get rid of the data that have less influence on my prediction.
To do this, I started with a simple example and I will implement that later on my real data. The following example consists of 5 columns (i.e., Five features or variables) and 100 rows (i.e., 100 samples).
my datasets are:
wt1 wt2 wt3 wt4 wt5 ko1 ko2 ko3 ko4 ko5
gene1 485 474 475 478 471 149 132 136 146 165
gene2 134 129 170 133 129 53 46 45 44 43
gene3 850 894 925 832 815 485 545 503 475 568
gene4 709 728 706 728 722 106 119 138 144 147
gene5 593 548 546 606 587 648 627 584 641 607
... ... ... ... ... ... ... ... ... ...
gene96 454 404 413 462 420 293 312 327 297 332
gene97 746 691 799 716 762 557 527 511 560 517
gene98 736 782 744 821 737 856 860 840 866 853
gene99 565 513 568 529 565 218 255 224 217 223
gene100 494 457 482 435 468 586 598 562 573 550
The features are wt1 to ko5, so I would like the PCA to tell me what are the wt or ko that I can remove without influencing the accuracy of my model
Here is my code:
import pandas as pd
import numpy as np
import random as rd
from sklearn.decomposition import PCA
from sklearn import preprocessing
import matplotlib.pyplot as plt
genes = ['gene' + str(i) for i in range(1,101)]
wt = ['wt' + str(i) for i in range(1,6)]
ko = ['ko' + str(i) for i in range(1,6)]
data = pd.DataFrame(columns=[*wt, *ko], index=genes)
# for each gene in the index(i.e. gene1, gene2,.. gene100), we create 5 values for the "wt" samples and 5 values for the "ko"..
# The mean can vary between 10 and 1000
for gene in data.index:
data.loc[gene,'wt1':'wt5'] = np.random.poisson(lam=rd.randrange(10,1000), size=5) # size = 5 ---> because we have wt1, wt2,... wt5
data.loc[gene,'ko1':'ko5'] = np.random.poisson(lam=rd.randrange(10,1000), size=5) # size = 5 ---> because we have ko1, ko2,... ko5
#print(data.head()) # only the first five rows
#print(data)
## Before we do PCA, we have to center and scale the data..
## After centering, the average value for each gene will be 0,
## After scaling, the standard deviation for the values for each gene will be 1
## Notice that we are passing in the transpose of our data, the scale function expects the samples to be rows instead of columns
scaled_data = preprocessing.scale(data.T) ## or StandardScaler().fit_transform(datalT)
# Variation is calculated in sklearn as: [(measurments - mean)**2/ the number of measurements]
# Variation is calculated in R as: [(measurments - mean)**2/ the number of measurements-1]
# There is no difference between the two methods..
pca = PCA() ## PCA here is an object
## Now we call the fit method on the scaled data
pca.fit(scaled_data) ## This is where we do all of the PCA math (i.e. calculate loading scores and the variation each principal component accounts for..)
pca_data = pca.transform(scaled_data) ## this is where we generate coordiantes for a PCA graph based on the loading scores and the scaled data..
## We'll start with a scree plot to see how many principal components shouldgo into the final plot..
# The first thing we do is calculate the percentage of variation that each principal component accounts for..
per_var = np.round(pca.explained_variance_ratio_* 100, decimals=1)
labels = ['PC' + str(x) for x in range(1, len(per_var)+1)]
plt.bar(x=range(1,len(per_var)+1), height = per_var, tick_label =labels)
plt.ylabel('Percentage of Explained Variance')
plt.xlabel('Principal Component')
plt.title('Scree Plot')
plt.show()
## Almost all the variation is along the first PC, so a 2-D graph, using PC1 and PC2 sholud do a good job representing the original data.
pca_df = pd.DataFrame(pca_data, index=[*wt, *ko], columns = labels) ## This is to organize the new data created by {pca.transform(scaled.data)}, into a matrix
plt.scatter(pca_df.PC1, pca_df.PC2)
plt.title('My PCA Graph')
plt.xlabel('PC1 - {0}%'.format(per_var[0]))
plt.ylabel('PC2 - {0}%'.format(per_var[1]))
for sample in pca_df.index:
plt.annotate(sample, (pca_df.PC1.loc[sample], pca_df.PC2.loc[sample]))
plt.show()
loading_scores = pd.Series(pca.components_[0], index=genes) # We'll start by creating a pandas "Series" object with the loading scores in PC1
sorted_loading_scores = loading_scores.abs().sort_values(ascending = False) #Sorting the loading scores based on their magnitude (absolute value)
top_10_genes = sorted_loading_scores[0:10].index.values ## Here we are just getting the names of the top 10 indexes (which are the gene names)
print(loading_scores[top_10_genes]) ## Printing out the top 10 gene names and their correspodning loading scores
The outputs of the code are the following figures:
As we can see that PC1 accounts for 89.5% of the data and PC2 accounts for 2.8% of the data..
So I can represent the original data by only using PC1 and PC2
My question is:
Is there a way to correlate PC1 and PC2 with the original data so I can understand what are the least important features in the original data?

Cannot open eps file after saving figure

Normally, opening an eps file is no problem but with this current code in Python that I am working on, the exported eps file is loading when opened but never appearing. I have tried exporting the same figure as a png and that works fine. Also I have tried exporting a really simple figure as eps and that opens without any flaws. I have included some of the relevant code concerning the plot/figure. Any help would be much appreciated.
#%% plot section
plt.close('all')
plt.figure()
plt.errorbar(r,omega,yerr=omega_err,fmt='mo')
plt.xlabel('xlabel')
plt.ylabel('ylabel')
plt.title('profile averaged from {} ms to {} ms \n shot {}'.format(tidsinterval[0],tidsinterval[1],skud_numre[0]),y=1.05)
plt.grid()
plt.axhline(y=2.45,color='Red')
plt.text(39,2.43,'txt block for horizontal line',backgroundcolor='white')
plt.axvline(x=37.5,color='Purple')
plt.text(37.5,1.2,'txt block for vertical line',ha='center',va="center",rotation='vertical',backgroundcolor='white')
plt.savefig('directory/plot.eps', format='eps')
plt.show()
The variables r, omega, omega_err are vectors of float of small sizes (6 perhaps).
Update: The program I use for opening eps-files is Evince, furthermore, one can download the eps file here https://filedropper.com/d/s/z7lxUCtANeox7tDMQ6dI6HZUpcTfHn. As far as I can see, it is fine sharing files over filedropper via community guidelines, but if I'm wrong please say so.
Found out that it is possible to open the file as long as there is no text contained in the plot (for example x-label,y-label, title and so on), so the problem has to be related to the text.
The short answer is it's your font. The /e glyph is throwing an error on setcachedevice (your PostScript interpreter should have told you this).
The actual problem is that the font program is careless (at least) about it's use of function name. The program contains this:
/mpldict 11 dict def
mpldict begin
/d { bind def } bind def
That creates a new dictionary called mpldict, begins that dictionary (makes it the topmost entry in the dictionary stack) and defines a function called 'd' in that dictionary
We then move on to the font definition, there's a lot of boiler plate in here, but each character shape is defined by an entry in the font's CharStrings dictionary, we'll pick that up with the definition of the function called 'd' in the font's CharStrings dictionary.
/d{1300 0 113 -29 1114 1556 sc
930 950 m
930 1556 l
ce} d
(2.60) == flush
/e{1260 0 113 -29 1151 1147 sc
1151 606 m
1151 516 l
305 516 l
313 389 351 293 419 226 c
488 160 583 127 705 127 c
776 127 844 136 910 153 c
977 170 1043 196 1108 231 c
1108 57 l
1042 29 974 8 905 -7 c
836 -22 765 -29 694 -29 c
515 -29 374 23 269 127 c
165 231 113 372 113 549 c
113 732 162 878 261 985 c
360 1093 494 1147 662 1147 c
813 1147 932 1098 1019 1001 c
1107 904 1151 773 1151 606 c
967 660 m
966 761 937 841 882 901 c
827 961 755 991 664 991 c
561 991 479 962 417 904 c
356 846 320 764 311 659 c
967 660 l
ce} d
Notice that what this does is create a new definition of a function named 'd' in the current dictionary. That's not a problem in itself. We now have two functions named 'd'; one in the current dictionary (the font's CharStrings dictionary) and one in 'mpldict'.
Then we define the next character:
/e{1260 0 113 -29 1151 1147 sc
1151 606 m
1151 516 l
305 516 l
313 389 351 293 419 226 c
488 160 583 127 705 127 c
776 127 844 136 910 153 c
977 170 1043 196 1108 231 c
1108 57 l
1042 29 974 8 905 -7 c
836 -22 765 -29 694 -29 c
515 -29 374 23 269 127 c
165 231 113 372 113 549 c
113 732 162 878 261 985 c
360 1093 494 1147 662 1147 c
813 1147 932 1098 1019 1001 c
1107 904 1151 773 1151 606 c
967 660 m
966 761 937 841 882 901 c
827 961 755 991 664 991 c
561 991 479 962 417 904 c
356 846 320 764 311 659 c
967 660 l
ce} d
Now, the last thing we do at the end of defining that character shape (for the character named 'e') is call a function named 'd'. But there are two, which one do we call ? The answer is that we work backwards down the dictionary stack looking in each dictionary to see if it has a function called 'd' and we use the first one we find. The current dictionary is the font's CharStrings dictionary, and it has a function called 'd' (which defines the 'd' character) so we call that.
And that function then tries to use setcachedevice. That operator is not legal except when executing a character description, which we are not doing, so it throws an undefined error.
Now your PostScript interpreter should tell you there is an error (Ghostscript, for example, does so). Because there is an error the interpreter stops and doesn't draw anything further, which is why you get a blank page.
What can you do about this ? Well you could raise a bug report with the creating application (apparently Matplotlib created the font too). This is not a good way to define a font!
Other than that, well frankly the only thing you can do is search and replace through the file. If you look for occurrences of ce} d and replace them with ce}bind def it'll probably work. This time.

sklearn StandardScaler doesn't seem to be working properly

I am trying to normalise my data so that it will be normally distributed needed for a later hypothesis test. The data I am trying to normalise, points, is as such:
P100m Plj Psp Phj P400m P110h Ppv Pdt Pjt P1500
0 938 1061 773 859 896 911 880 732 757 752
1 839 975 870 749 887 878 880 823 863 741
2 814 866 841 887 921 939 819 778 884 691
3 872 898 789 878 848 879 790 790 861 804
4 892 913 742 803 816 869 1004 789 854 699
... ... ... ... ... ... ... ... ... ...
7963 755 760 604 714 812 794 482 571 539 780
7964 830 845 524 767 786 783 601 573 562 535
7965 819 804 653 840 791 699 659 461 448 632
7966 804 720 539 758 830 782 731 487 425 729
7967 687 809 692 714 565 741 804 527 738 523
I am using sklearn.preprocessing.StandardScaler() and my code is as follows:
scaler = preprocessing.StandardScaler()
scaler.fit(points)
points_norm = scaler.transform(points)
points_norm_df = pd.DataFrame(points_norm, columns = ['P100m', 'Plj', 'Psp', 'Phj', 'P400m',
'P110h', 'Ppv', 'Pdt', 'Pjt','P1500'])
The strange part is that I am running an Anderson-Darling normality test from scipy.stats.anderson and the result is that it is very far from a normal distribution.
I am not the most proficient statistician. Am I misunderstanding what I am doing here or is it a problem with my code/data?
Any help would be greatly appreciated
The StandardScaler does not claim to make the data have a normal distribution rather than to Standardize so that your data will have zero mean and unit variance.
From the documentation:
Standardize features by removing the mean and scaling to unit variance
The standard score of a sample x is calculated as z = (x - u) / s
where u is the mean of the training samples or zero if
with_mean=False, and s is the standard deviation of the training
samples or one if with_std=False.
As gilad already pointed out the StandardScaler is standardizing your data.
You can find a list of methods here for preprocessing: https://scikit-learn.org/stable/modules/preprocessing.html
Are you searching for:
6.3.2.1. Mapping to a Uniform distribution
QuantileTransformer and quantile_transform provide a non-parametric
transformation to map the data to a uniform distribution with values
between 0 and 1
this would work somehow like this:
quantile_transformer = preprocessing.QuantileTransformer(random_state=0)
points_norm = quantile_transformer.fit_transform(points)

Selecting Column from pandas Series

I have a Series named 'graph' in pandas that looks like this:
Wavelength
450 37
455 31
460 0
465 0
470 0
475 0
480 418
485 1103
490 1236
495 894
500 530
505 85
510 0
515 168
520 0
525 0
530 691
535 842
540 5263
545 4738
550 6237
555 1712
560 767
565 620
570 0
575 757
580 1324
585 1792
590 659
595 1001
600 601
605 823
610 0
615 134
620 3512
625 266
630 155
635 743
640 648
645 0
650 583
Name: A1, dtype: object
I am graphing the curve using graph.plot(), which looks like this :
The goal is to smooth the curve. I was trying to use the Savgol_Filter, but to do that I need to separate my series into x & y columns. As of right now, I can acess the "Wavelength" column by using graph.index, but I can't grab the next column to assign it as y.
I've tried using iloc and loc and haven't had any luck yet.
Any tips or new directions to try?
You don't need to pass an x and a y to savgol_filter. You just need the y values which get passed automatically when you pass graph to it. What you are missing is the window size parameter and the polygon order parameter that define the smoothing.
from scipy.signal import savgol_filter
import pandas as pd
# I passed `graph` but I could've passed `graph.values`
# It is `graph.values` that will get used in the filtering
pd.Series(savgol_filter(graph, 7, 3), graph.index).plot()
To address some other points of misunderstanding
graph is a pandas.Series and NOT a pandas.DataFrame. A pandas.DataFrame can be thought of as a pandas.Series of pandas.Series.
So you access the index of the series with graph.index and the values with graph.values.
You could have also done
import matplotlib.pyplot as plt
plt.plot(graph.index, savgol_filter(graph.values, 7, 3))
As you are using Series instead of DataFrame, some libraries could not access index to use it as a column.Use:
df = df.reset_index()
it will convert the index to an extra column you can use in savgol filter or any other.

Categories