Cannot save to grib2 file using python iris module - python

I'm using the python iris module to read in some netCDF data and output specific fields in grib format for further downstream processing. However I generate the following error
.../pythonlib/iris/1.9.1/lib/python2.7/site-packages/Iris-1.9.1-py2.7-linux-x86_64.egg/iris/fileformats/grib/_save_rules.pyc in gribbability_check(cube)
1062 cs1 = cube.coord(dimensions=[1]).coord_system
1063 if cs0 is None or cs1 is None:
-> 1064 raise iris.exceptions.TranslationError("CoordSystem not present")
1065 if cs0 != cs1:
1066 raise iris.exceptions.TranslationError("Inconsistent CoordSystems")
TranslationError: CoordSystem not present
So after having read the following :
Iris Google group thread https://groups.google.com/forum/#!searchin/scitools-iris/grib2/scitools-iris/D2InfYESaUM/yVT7ayXSFV0J
StackOverflow thread Converting NetCDF to GRIB2
iris source code at https://github.com/SciTools/iris/blob/master/lib/iris/fileformats/grib/grib_save_rules.py#L80
I attempted the following
In [26]: radius=iris.fileformats.pp.EARTH_RADIUS
In [27]: u.coord(dimensions=[0]).coord_system=iris.coord_systems.GeogCS(radius)
In [28]: u.coord(dimensions=[1]).coord_system=iris.coord_systems.GeogCS(radius)
In [29]: u.coord(dimensions=[0]).coord_system
Out[29]: GeogCS(6371229.0)
In [30]: u.coord(dimensions=[1]).coord_system
Out[30]: GeogCS(6371229.0)
In [31]: iris.save(u,'prod.grib2')
---------------------------------------------------------------------------
TranslationError Traceback (most recent call last)
<ipython-input-15-a38abe1720ac> in <module>()
----> 1 iris.save(u,'prod.grib2')
i.e. I still generate the same error, a failure in the iris subroutine gribbability_check
Hoping someone can help. I'm using iris 1.9.0 with python 2.7.6. The operation also fails with iris 1.8.0
Cheers

Thanks to Andrew Dawson on the iris google group for the answer. Dimensions [0] and [1] in grib_save_rules.py strictly refer to spatial dimensions, even if your cube may use time for the zeroth dimension. To quote:
There is a huge amount of code in between your cube and saving as
grib2. Since grib knows nothing about dimensionalities above 2 (it
only stores 1 grid per message) we split your cube up into one slice
per grid and pass that onward, hence in the function you are referring
to dimension 0 is latitude and 1 is longitude regardless of how many
other dimensions your cube had.
If I repeat the process but prescribe the coord_system to my spatial dimensions and give the vertical co-ordinate an attribute as well using
cube.coord('vertical_level').standard_name = 'air_pressure'
The grib can be saved.

Related

How to make the Open3D read the pandas DataFrame and generate points clouds in Python

I extracted certain data from the original CSV file (which contains the XYZ coordinates) by using the following code
.
data=pd.read_csv("./assets/landmarks_frame0.csv",header=None,usecols=range(1,4))
print(data)
The printing output looks fine as below. Recall that the first (started with 0.524606), second and third columns correspond to the x,y and z coordinates.
the snipped image of the pandas DataFrame extracted from the CSV file
Meanwhile, my goal is to import the Open3D library and generate the points cloud based on the data extracted from the pandas. I read the Open3D documents (http://www.open3d.org/docs/release/tutorial/geometry/pointcloud.html) and wrote the script as follows
print("Load a ply point cloud, print it, and render it")
pcd = o3d.io.read_point_cloud(data,format="xyz")
print(pcd)
print(np.asarray(pcd.points))
o3d.visualization.draw_geometries([pcd],
zoom=0.3412,
front=[0.4257, -0.2125, -0.8795],
lookat=[2.6172, 2.0475, 1.532],
up=[-0.0694, -0.9768, 0.2024])
As shown in the second line
pcd = o3d.io.read_point_cloud(data,format="xyz")
I learned from the File IO document (http://www.open3d.org/docs/release/tutorial/geometry/file_io.html) and passed the first argument as the data to be processed into the points cloud. Besides, I set the second argument format to be 'xyz', which means each line contains [x, y, z], where x, y, and z are the 3D coordinates.
However, the error message indicates as follow.
TypeError Traceback (most recent call last)
Input In [3], in <cell line: 4>()
1 print("Load a ply point cloud, print it, and render it")
2 # ply_point_cloud = o3d.data.PLYPointCloud()
3 # pcd = o3d.io.read_point_cloud(data,format="xyz")
----> 4 pcd = o3d.io.read_point_cloud(data,format="xyz")
6 print(pcd)
7 print(np.asarray(pcd.points))
TypeError: read_point_cloud(): incompatible function arguments. The following argument types are supported:
1. (filename: str, format: str = 'auto', remove_nan_points: bool = False, remove_infinite_points: bool = False, print_progress: bool = False) -> open3d.cpu.pybind.geometry.PointCloud
Invoked with: 1 2 3
0 0.524606 0.675098 -0.021419
1 0.524134 0.628257 -0.034960
2 0.524757 0.641571 -0.019187
3 0.518863 0.589718 -0.024071
4 0.523975 0.615806 -0.036730
.. ... ... ...
473 0.557430 0.553579 0.006053
474 0.563593 0.553342 0.006053
475 0.557327 0.544035 0.006053
476 0.551414 0.553678 0.006053
477 0.557613 0.563182 0.006053
[478 rows x 3 columns]; kwargs: format='xyz'
I would like to know how I should correctly import the data into the Open3D and generate the point cloud. I appreciate your help.
Open3D supports NumPy arrays. So, firstly you have to convert your dataframe with XYZ coordinates to a NumPy array. This will allow you to convert the NumPy array to the Open3D point cloud. You can check the documentation (here) of Open3D for further details.
The important lines (of documentation) for the conversion of a NumPy array to an Open3D point cloud are given below:
# Pass xyz to Open3D.o3d.geometry.PointCloud and visualize
pcd = o3d.geometry.PointCloud()
pcd.points = o3d.utility.Vector3dVector(xyz)
Here, 'xyz' is the NumPy array.

recover column details after PCA and Kmeans

I did KMeans clustering after reducing numerical columns in my DataFrame from 5 to 2 using PCA and plotted scatterplot
pc=PCA(n_components = 2).fit_transform(scaled_df)
scaled_df_PCA= pd.DataFrame(pc, columns=['pca_col1','pca_col2'])
#Then I did the KMeans and its plotting
label_PCA=final_km.fit_predict(scaled_df_PCA)
scaled_df_PCA["label_PCA_df"]=label_PCA
a=scaled_df_PCA[scaled_df_PCA.label_PCA_df==0]
b=scaled_df_PCA[scaled_df_PCA.label_PCA_df==1]
c=scaled_df_PCA[scaled_df_PCA.label_PCA_df==2]
sns.scatterplot(a.pca_col1, a.pca_col2, color="green")
sns.scatterplot(b.pca_col1, b.pca_col2, color="red")
sns.scatterplot(c.pca_col1, c.pca_col2, color="yellow")
I get 3 clusters from above based upon 2 columns reduced using PCA. Now I wish to get the columns back for further analysis of those clusters but I am not able to.
And when i use pc.components_ I get error :
AttributeError Traceback (most recent call last)
/tmp/ipykernel_33/4073743739.py in
----> 1 pc.components_
AttributeError: 'numpy.ndarray' object has no attribute 'components_'
or when I do scaled_df_PCA.components_
AttributeError: 'DataFrame' object has no attribute 'components_'
So I wanted to know how to recover details of columns back which were reduced during PCA.
This line from your code stores an NDArray into pc rather than the PCA instance.
pc=PCA(n_components = 2).fit_transform(scaled_df)
An easy fix is to create the PCA instance first and then call fit_transform().
pca = PCA(n_components=2)
df_transformed = pca.fit_transform(scaled_df)
Afterwards, you can still access attributes and methods of the PCA instance, pca.

sklearn KMedoids returns empty clusters

I am using KMedoids from sklearn_extra.cluster. I use it with a precalculated distance matrix (metric='precomputed') and it used to work. However, we found a bug in the way the distance matrix was calculated and therefore had to implement it ourselves. Since then the KMedoids algorithm doesn't work anymore. This is the stacktrace:
C:\Users\...\AppData\Local\Programs\Python\Python38-32\lib\site-packages\sklearn_extra\cluster\_k_medoids.py:231: UserWarning: Cluster 1 is empty! self.labels_[self.medoid_indices_[1]] may not be labeled with its corresponding cluster (1).
warnings.warn(enter code here
C:\Users\...\AppData\Local\Programs\Python\Python38-32\lib\site-packages\sklearn_extra\cluster\_k_medoids.py:231: UserWarning: Cluster 2 is empty! self.labels_[self.medoid_indices_[2]] may not be labeled with its corresponding cluster (2).
warnings.warn(
C:\Users\...\AppData\Local\Programs\Python\Python38-32\lib\site-packages\sklearn_extra\cluster\_k_medoids.py:231: UserWarning: Cluster 3 is empty! self.labels_[self.medoid_indices_[3]] may not be labeled with its corresponding cluster (3).
warnings.warn(
C:\Users\...\AppData\Local\Programs\Python\Python38-32\lib\site-packages\sklearn_extra\cluster\_k_medoids.py:231: UserWarning: Cluster 4 is empty! self.labels_[self.medoid_indices_[4]] may not be labeled with its corresponding cluster (4).
warnings.warn(
C:\Users\...\AppData\Local\Programs\Python\Python38-32\lib\site-packages\sklearn_extra\cluster\_k_medoids.py:231: UserWarning: Cluster 5 is empty! self.labels_[self.medoid_indices_[5]] may not be labeled with its corresponding cluster (5).
warnings.warn(
C:\Users\...\AppData\Local\Programs\Python\Python38-32\lib\site-packages\sklearn_extra\cluster\_k_medoids.py:231: UserWarning: Cluster 6 is empty! self.labels_[self.medoid_indices_[6]] may not be labeled with its corresponding cluster (6).
warnings.warn(
C:\Users\...\AppData\Local\Programs\Python\Python38-32\lib\site-packages\sklearn_extra\cluster\_k_medoids.py:231: UserWarning: Cluster 7 is empty! self.labels_[self.medoid_indices_[7]] may not be labeled with its corresponding cluster (7).
warnings.warn(
I have checked the distance matrix, it is a two-dimensional nparray with dimensions of n_data x n_data where the values on the diagonal are zero, so that should not be the problem. All the values are between 0 and 1. We used to use this algorithm for the Gower distance, but that did not work when we only had categorical data for some reason. All our values are boolean values. The Gower distance returned the following:
File "C:\Users\...\AppData\Local\Programs\Python\Python38-32\lib\site-packages\gower\gower_dist.py", line 62, in gower_matrix
Z_num = np.divide(Z_num ,num_max,out=np.zeros_like(Z_num), where=num_max!=0)
TypeError: ufunc 'true_divide' output (typecode 'd') could not be coerced to provided output parameter (typecode '?') according to the casting rule ''same_kind''
I also tried pyclustering KMedoids and that did work. However, you need to define the initial medoids yourself using pyclustering and the method I found for that did not work with categorical data. (see below)
initial_medoids = kmeans_plusplus_initializer(data, n_clus, kmeans_plusplus_initializer.FARTHEST_CENTER_CANDIDATE).initialize(return_index=True)
Stacktrace:
File "path_to_file", line 19, in <module>
initial_medoids = kmeans_plusplus_initializer(data, n_clus, kmeans_plusplus_initializer.FARTHEST_CENTER_CANDIDATE).initialize(return_index=True)
File "path\Python\Python38-32\lib\site-packages\pyclustering\cluster\center_initializer.py", line 357, in initialize
index_point = self.__get_next_center(centers)
File "path\Python\Python38-32\lib\site-packages\pyclustering\cluster\center_initializer.py", line 256, in __get_next_center
distances = self.__calculate_shortest_distances(self.__data, centers)
File "path\Python\Python38-32\lib\site-packages\pyclustering\cluster\center_initializer.py", line 236, in __calculate_shortest_distances
dataset_differences[index_center] = numpy.sum(numpy.square(data - center), axis=1).T
TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.
My problem can be fixed in three ways, so I hope that someone can help me:
Someone knows why KMedoids by sk-learn doesn't work and can help me with that, so I can use it.
Someone knows what I'm doing wrong with the Gower function from PyPI, so I can use either pyclustering or sklearn.
Someone knows how I can easily find the initial medoids to use for pyclustering, so I can use pyclustering.
I have posted a simple version of the code below.
import pandas as pd
import gower_distance as dist
from sklearn_extra.cluster import KMedoids
data = pd.read_csv(path_to_data)
dist = calcDist(data) # Returns NxN array where N is the amount of data points
# I'm using 8 clusters, which is the default, so I haven't defined it
kmedoids = KMedoids(metric='precomputed').fit(dist)
labels = kmedoids.predict(dist)
I also received that warning (however using the euclidean-distance). using another initialization of the cluster cores fixed it for me:
kmedoids = KMedoids(metric='precomputed', init='k-medoids++').fit(dist)
To get cluster labels from the trained model (i.e. train label),
data = pd.read_csv(path_to_data)
dist = calcDist(data)
kmedoids = KMedoids(metric='precomputed').fit(dist)
labels = kmedoids.labels_
To use kmedoids.predict with any predict data using the trained k-medoids model, you need to compute N x K distance matrix from N predict data to K medoids, properly indexed.
medoids = predictData[kmedoids.medoid_indices_, :]
distToMedoids = calcDistToMedoids(predictData, medoids) # with the same metric used in training
predict_labels = kmedoids.predict(distToMedoids)
predict_labels = np.argmin(distToMedoids, axis=1) # what .predict() does
You can check more from the source code.

Shogun / quadratic MMD error caused by varying train_test_ratio

I'm using Shogun to run MMD (quadratic) and compare two nonparametric distributions based on their samples (code below is for 1D, but I've also looked at 2D samples). In the toy problem shown below, I try to change the ratio between training and testing samples in the process of selecting an optimized kernel (KSM_MAXIMIZE_MMD is the selection strategy; I've also used KSM_MEDIAN_HEURISTIC). It appears that any ratio other than 1 yields an error.
Am I allowed to change this ratio in this setting?
(I see that it is used at: http://www.shogun-toolbox.org/examples/latest/examples/statistical_testing/quadratic_time_mmd.html, but it is set to 1 there)
Concise version of the my code (inspired by the notebook available at: http://www.shogun-toolbox.org/notebook/latest/mmd_two_sample_testing.html):
import shogun as sg
import numpy as np
from scipy.stats import laplace, norm
n = 220
mu = 0.0
sigma2 = 1
b=np.sqrt(0.5)
X = sg.RealFeatures((norm.rvs(size=n) * np.sqrt(sigma2) + mu).reshape(1,-1))
Y = sg.RealFeatures(laplace.rvs(size=n, loc=mu, scale=b).reshape(1,-1))
mmd = sg.QuadraticTimeMMD(X, Y)
mmd.add_kernel(sg.GaussianKernel(10, 1.0))
mmd.set_kernel_selection_strategy(sg.KSM_MAXIMIZE_MMD)
mmd.set_train_test_mode(True)
mmd.set_train_test_ratio(1)
mmd.select_kernel()
mmd_kernel = sg.GaussianKernel.obtain_from_generic(mmd.get_kernel())
kernel_width = mmd_kernel.get_width()
statistic = mmd.compute_statistic()
p_value = mmd.compute_p_value(statistic)
print p_value
This exact version runs and prints p-values just fine.
If I change the argument passed to mmd.set_train_test_ratio() from 1 to 2, I get:
SystemErrorTraceback (most recent call last)
<ipython-input-30-dd5fcb933287> in <module>()
25 kernel_width = mmd_kernel.get_width()
26
---> 27 statistic = mmd.compute_statistic()
28 p_value = mmd.compute_p_value(statistic)
29
SystemError: [ERROR] In file /feedstock_root/build_artefacts/shogun-cpp_1512688880429/work/shogun-shogun_6.1.3/src/shogun/statistical_testing/internals/mmd/ComputeMMD.h line 90: assertion kernel_matrix.num_rows==size && kernel_matrix.num_cols==size failed in float32_t shogun::internal::mmd::ComputeMMD::operator()(const shogun::SGMatrix<T>&) const [with T = float; float32_t = float] file /feedstock_root/build_artefacts/shogun-cpp_1512688880429/work/shogun-shogun_6.1.3/src/shogun/statistical_testing/internals/mmd/ComputeMMD.h line 90
It gets worse, if I use the value below 1. In addition to the following error,
jupyter notebook kernel crashes every time (after which I need to rerun the entire notebook; the message says: "The kernel appears to have died. It will restart automatically.").
SystemErrorTraceback (most recent call last)
<ipython-input-31-cb4a5224f4ef> in <module>()
20 mmd.set_train_test_ratio(0.5)
21
---> 22 mmd.select_kernel()
23
24 mmd_kernel = sg.GaussianKernel.obtain_from_generic(mmd.get_kernel())
SystemError: [ERROR] In file /feedstock_root/build_artefacts/shogun-cpp_1512688880429/work/shogun-shogun_6.1.3/src/shogun/kernel/Kernel.h line 210: GaussianKernel::kernel(): index out of Range: idx_a=146/146 idx_b=0/146
Complete code (in a jypyter notebook) can be found at: http://nbviewer.jupyter.org/url/dmitry.duplyakin.org/p/jn/kernel-minimal.ipynb
Please let me know if I am missing a step or need to try a different approach.
Side questions:
Both http://www.shogun-toolbox.org/examples/latest/examples/statistical_testing/quadratic_time_mmd.html and http://www.shogun-toolbox.org/notebook/latest/mmd_two_sample_testing.html show examples of using sg.GaussianKernel(10, <width>). I couldn't find more information about the 1st parameter other than its name, cache size. How and when am I supposed to change it?
As mentioned in the referenced notebook, mmd.get_kernel_selection_strategy().get_name() returns only the generic name, specifically KernelSelectionStrategy. How can I obtain a more specific name for the selected strategy (e.g., KSM_MEDIAN_HEURISTIC) from an instance of the sg.QuadraticTimeMMD class?
Any relevant information or references will be greatly appreciated.
Shogun version: v6.1.3_2017-12-7_19:14
The train_test_ratio attribute is the ratio between the number of samples used in training and the number of samples used in testing. When you have train_test_mode turned on, the way it decides how many samples to fetch in each mode goes something like this.
num_training_samples = m_num_samples * train_test_ratio / (train_test_ratio + 1)
num_testing_samples = m_num_samples / (train_test_ratio + 1)
It implicitly assumes the divisibility. A train_test_ratio of 2 would, therefore, try to use 2/3rd of the data for training and 1/3rd of the data for testing, which is problematic for the total number of samples you have, 220. By the logic, it sets num_training_samples = 146 and num_testing_samples = 73, which doesn't add up to 220. Similar issues arise when using 0.5 as the train-test ratio. If you use some other values for the train_test_ratio which splits the total number of samples perfectly, I think these errors would go away.
I am not totally sure but I think the cache makes sense when you're using SVMLight with Shogun. Please check http://svmlight.joachims.org/ for details. From their page
-m [5..] - size of cache for kernel evaluations in MB (default 40)
The larger the faster...
There's no pretty-print for the kernel-selection strategy being used, but you could do mmd.get_kernel_selection_strategy().get_method() which returns you the enum value (of type EKernelSelectionMethod) which might be helpful. Since it's not documented yet in Shogun api-doc, here's the C++ equivalent for this that you might use.
enum EKernelSelectionMethod
{
KSM_MEDIAN_HEURISTIC,
KSM_MAXIMIZE_MMD,
KSM_MAXIMIZE_POWER,
KSM_CROSS_VALIDATION,
KSM_AUTO = KSM_MAXIMIZE_POWER
};
Summary (from comments):
The bug does not show up in the latest code
Solution is in: https://github.com/shogun-toolbox/shogun/pull/4134

sklearn.gaussian_process fit() not working with array sizes greater than 100

I am generating a random.uniform(low=0.0, high=100.0, size=(150,150)) array.
I input this into a function that generates the X, x, and y.
However, if the random test matrix is greater than 100, I get the error below.
I have tried playing around with theta values.
Has anyone had this problem? Is this a bug?
I am using python2.6 and scikit-learn-0.10. Should I try python3?
Any suggestions or comments are welcome.
Thank you.
gp.fit( XKrn, yKrn )
File "/usr/lib/python2.6/scikit_learn-0.10_git-py2.6-linux-x86_64.egg/sklearn/gaussian_process/gaussian_process.py", line 258, in fit
raise ValueError("X and y must have the same number of rows.")
ValueError: X and y must have the same number of rows.
ValueError: X and y must have the same number of rows. means that in your case XKrn.shape[0] should be equal to yKrn.shape[0]. You probably have an error in the code generating the dataset.
Here is a working example:
In [1]: from sklearn.gaussian_process import GaussianProcess
In [2]: import numpy as np
In [3]: X, y = np.random.randn(150, 10), np.random.randn(150)
In [4]: GaussianProcess().fit(X, y)
Out[4]:
GaussianProcess(beta0=None,
corr=<function squared_exponential at 0x10d42aaa0>, normalize=True,
nugget=array(2.220446049250313e-15), optimizer='fmin_cobyla',
random_start=1,
random_state=<mtrand.RandomState object at 0x10b4c8360>,
regr=<function constant at 0x10d42a488>, storage_mode='full',
theta0=array([[ 0.1]]), thetaL=None, thetaU=None, verbose=False)
Python 3 is not supported yet and the latest released version of scikit-learn is 0.12.1 at this time.
My original post was deleted. Thanks, Flexo.
I had the same problem, and number of rows I was passing in was the same in my X and y.
In my case, the problem was in fact that I was passing in a number of features to fit against in my output. Gaussian processes fit to a single output feature.
The "number of rows" error was misleading, and stemmed from the fact that I wasn't using the package correctly. To fit multiple output features like this, you'll need a GP for each feature.

Categories