I have an xarray dataset of sea surface temperature values on an x/y grid. x and y are 1D vector coordinates, so it looks like this minimal example:
<xarray.Dataset>
Dimensions: (x: 10, y: 10)
Coordinates:
* x (x) int64 0 1 2 3 4 5 6 7 8 9
* y (y) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
data (x, y) float64 0.559 0.01037 0.1562 ... 0.08778 0.3272 0.8661
I am able to compute the lat/lon from this x/y grid, and the output is 2 2D arrays. I can add them as coordinates with ds.assign_coords:
<xarray.Dataset>
Dimensions: (x: 10, y: 10)
Coordinates:
* x (x) int64 0 1 2 3 4 5 6 7 8 9
* y (y) int64 0 1 2 3 4 5 6 7 8 9
lat (x, y) float64 30.0 30.0 30.0 30.0 30.0 ... 39.0 39.0 39.0 39.0
lon (x, y) float64 -120.0 -119.0 -118.0 -117.0 ... -113.0 -112.0 -111.0
Data variables:
data (x, y) float64 0.559 0.01037 0.1562 ... 0.08778 0.3272 0.8661
But I'd like to .sel along slices of the lat/lon. This currently isn't possible, as I get the error:
ds.sel(lat=slice(32,36), lon=slice(-118, -115))
ValueError Traceback (most recent call last)
<ipython-input-20-28c79202d5f3> in <module>
----> 1 ds.sel(lat=slice(32,36), lon=slice(-118, -115))
~/.local/lib/python3.8/site-packages/xarray/core/dataset.py in sel(self, indexers, method, tolerance, drop, **indexers_kwargs)
2363 """
2364 indexers = either_dict_or_kwargs(indexers, indexers_kwargs, "sel")
-> 2365 pos_indexers, new_indexes = remap_label_indexers(
2366 self, indexers=indexers, method=method, tolerance=tolerance
2367 )
~/.local/lib/python3.8/site-packages/xarray/core/coordinates.py in remap_label_indexers(obj, indexers, method, tolerance, **indexers_kwargs)
419 }
420
--> 421 pos_indexers, new_indexes = indexing.remap_label_indexers(
422 obj, v_indexers, method=method, tolerance=tolerance
423 )
~/.local/lib/python3.8/site-packages/xarray/core/indexing.py in remap_label_indexers(data_obj, indexers, method, tolerance)
256 new_indexes = {}
257
--> 258 dim_indexers = get_dim_indexers(data_obj, indexers)
259 for dim, label in dim_indexers.items():
260 try:
~/.local/lib/python3.8/site-packages/xarray/core/indexing.py in get_dim_indexers(data_obj, indexers)
222 ]
223 if invalid:
--> 224 raise ValueError(f"dimensions or multi-index levels {invalid!r} do not exist")
225
226 level_indexers = defaultdict(dict)
ValueError: dimensions or multi-index levels ['lat', 'lon'] do not exist
So my question is this: How can I change the dimensions of data to be (lon: (10,10), lat: (10,10)) instead of (x: 10, y: 10? Is this even possible?
Code to reproduce the example dataset:
import numpy as np
import xarray as xr
# Create sample data
data = np.random.rand(10,10)
x = y = np.arange(10)
# Set up dataset
ds = xr.Dataset(
data_vars = dict(
data = (["x", "y"], data)
),
coords= {
"x" : x,
"y" : y
}
)
# Create example lat/lon and assign to dataset
lon, lat = np.meshgrid(np.linspace(-120, -111, 10), np.linspace(30, 39, 10))
ds = ds.assign_coords({
"lat": (["x", "y"], lat),
"lon": (["x", "y"], lon)
})
Related
I have an xarray.Dataset that looks roughly like this:
<xarray.Dataset>
Dimensions: (index: 286720)
Coordinates:
* index (index) int64 0 1 2 3 4 ... 286716 286717 286718 286719
Data variables:
Time (index) float64 2.525 2.525 2.525 ... 9.475 9.475 9.475
ch (index) int64 1 1 1 1 1 1 1 1 1 1 ... 2 2 2 2 2 2 2 2 2 2
pixel (index) int64 1 2 3 4 5 6 ... 1020 1021 1022 1023 1024
Rough_wavelength (index) float64 2.698 2.701 2.704 ... 32.05 32.05 32.06
Count (index) int64 463 197 265 335 305 ... 285 376 278 0 278
There are only 140 unique values for the Time variable, 2 for the ch(...annel), and 1024 for the pixel value. I'd thus like to turn them into coordinates and completely drop the largely irrelevant index coordinate, something like this:
<xarray.Dataset>
Dimensions: (Time: 140, ch: 2, pixel: 1024)
Coordinates:
Time (time) float64 2.525 ... 9.475
ch (ch) int64 1 2
pixel (pixel) int64 1 2 3 4 5 6 ... 1020 1021 1022 1023 1024
Data variables:
Rough_wavelength (time, ch, pixel) float64 2.698 ... 32.06
Count (time, ch, pixel) int64 463 ... 278
Is there a way to do this using xarray? If not, what's a sane way to do this using the standard numpy stack?
Replace the index coordinate with a pd.MultiIndex, then unstack the index:
In [10]: ds.assign_coords(
...: {
...: "index": pd.MultiIndex.from_arrays(
...: [ds.Time.values, ds.ch.values, ds.pixel.values],
...: names=["Time", "ch", "pixel"],
...: )
...: }
...: ).drop_vars(["Time", "ch", "pixel"]).unstack("index")
I would need to transform float to int. However, I would like to not loose any information while converting it. The values (from a dataframe column used a y in modeling build) that I am taking into account are as follows:
-1.0
0.0
9.0
-0.5
1.5
1.5
...
If I convert them into int directly, I might get -0.5 as 0 or -1, so I will loose some information.
I need to convert the values above to int because I need to pass them to fit a model model.fit(X, y). Any format that could allow me to pass these values in the fit function (the above column is meant y column)?
Code:
from sklearn.preprocessing import MinMaxScaler
le = preprocessing.LabelEncoder()
X = df[['Col1','Col2']].apply(le.fit_transform)
X_transformed=np.concatenate(((X[['Col1']]),(X[['Col2']])), axis=1)
y=df['Label'].values
scaler=MinMaxScaler()
X_scaled=scaler.fit_transform(X_transformed)
model_LS = LabelSpreading(kernel='knn',
gamma=70,
alpha=0.5,
max_iter=30,
tol=0.001,
n_jobs=-1,
)
LS=model_LS.fit(X_scaled, y)
Data:
Col1 Col2 Label
Cust1 Cust2 1.0
Cust1 Cust4 1.0
Cust4 Cust5 -1.5
Cust12 Cust6 9.0
The error that I am getting running the above code is:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-174-14429cc07d75> in <module>
2
----> 3 LS=model_LS.fit(X_scaled, y)
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/semi_supervised/_label_propagation.py in fit(self, X, y)
228 X, y = self._validate_data(X, y)
229 self.X_ = X
--> 230 check_classification_targets(y)
231
232 # actual graph construction (implementations should override this)
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/multiclass.py in check_classification_targets(y)
181 if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
182 'multilabel-indicator', 'multilabel-sequences']:
--> 183 raise ValueError("Unknown label type: %r" % y_type)
184
185
ValueError: Unknown label type: 'continuous'
You can multiply your values to remove the decimal part:
df = pd.DataFrame({'Label': [1.0, -1.3, 0.75, 9.0, 7.8236]})
decimals = df['Label'].astype(str).str.split('.').str[1].str.len().max()
df['y'] = df['Label'].mul(float(f"1e{decimals}")).astype(int)
print(df)
# Output:
Label y
0 1.0000 10000
1 -1.3000 -13000
2 0.7500 7500
3 9.0000 90000
4 7.8236 78236
I think you need:
from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame(data={'y':[-1.0, 0.0 , 9.0, -0.5, 1.5 , 1.5]})
le = LabelEncoder()
le.fit(df['y'])
df['y'] = le.transform(df['y'])
print(df)
OUTPUT
y
0 0
1 2
2 4
3 1
4 3
5 3
I have two dataframes.
The first dataframe (map) consists of two columns: "X" and "Y". map is 83150 rows.
The second data frame (cords) consists of two columns: "X Rotate" and "Y Rotate". coords is 2702 rows.
The objective is to find the nearest neighbor for each (X,Y) coordinate within map to the (X Rotate, Y Rotate) coordinates within coords.
In order to do this, I duplicate each row within coords 31 times because of 83150/2702. Now, coords has 83762 rows. This means each (X,Y) coordinate will find its nearest neighbor to the (X Rotate, Y Rotate) and there will be 612 coordinates within coords that will not have a nearest neighbor match.
This is the function to make this happen:
def nearest_neighbors(df, map):
num_pts = math.ceil(map.shape[0] / df.shape[0])
map = map[["X", "Y"]].to_numpy()
duplicate_cords_df = pd.DataFrame(np.repeat(df.values, num_pts, axis=0), columns=df.columns)
duplicate_cords_sub = duplicate_cords_df[["X Rotate", "Y Rotate"]].to_numpy()
duplicate_cords_sub = duplicate_cords_sub.to_numpy()
list_of_dicts = []
for row in map:
map_tree = spatial.cKDTree(duplicate_cords_sub)
distance, index = map_tree.query(row)
cols = ["Map X", "Map Y", "X Rotate", "Y Rotate", "Distance"]
map_x = row[0]
map_y = row[1]
coords_x = (duplicate_cords_sub[index]).flat[0]
coords_y = (duplicate_cords_sub[index]).flat[1]
results = [map_x, map_y, coords_x, coords_y, distance]
results_dict = dict(zip(cols, results))
list_of_dicts.append(results_dict)
results_df = pd.DataFrame(list_of_dicts)
return results_df
However, when I check the count for the number of duplicates in results_df, I notice that each (X Rotate, Y Rotate) coordinate is being used a varying number of times.
overall_df_dup = results_df.groupby(['X Rotate', 'Y Rotate']).size().reset_index(name='count')
print(overall_df_dup)
X Rotate Y Rotate count
0 -74.25 0.00 16
1 -72.48 -12.37 28
2 -72.48 -8.84 37
3 -72.48 -5.30 43
4 -72.48 -1.77 39
... ... ... ...
2697 70.71 14.14 62
2698 72.48 -8.84 45
2699 72.48 -1.77 55
2700 72.48 1.77 47
2701 72.48 5.30 48
I checked the duplicates count of the data frame given to the KDTree function and it was correct:
coords_dup = duplicate_cords.groupby(['X Rotate', 'Y Rotate']).size().reset_index(name='count')
print(coords_dup)
X Rotate Y Rotate count
0 -74.25 -0.00 31
1 -72.48 -12.37 31
2 -72.48 -8.84 31
3 -72.48 -5.30 31
4 -72.48 -1.77 31
... ... ... ...
2697 70.71 14.14 31
2698 72.48 -8.84 31
2699 72.48 -1.77 31
2700 72.48 1.77 31
2701 72.48 5.30 31
How does the resulting df contain more duplicates of the coordinates than existing in the original data frame fed into the KdTree function?
Bonus question: Is it possible to have each (X Rotate, Y Rotate) coordinate be mapped to 30 times and only some (X Rotate, Y Rotate) be mapped to 31 times? Ideally, I want each (X Rotate, Y Rotate) coordinate to be mapped to 30 times regardless.
Probably not the right answer in first intention however this can help using KDTree
Create a minimal reproducible example:
import pandas as pd
import numpy as np
from scipy.spatial import cKDTree
gen_coords = lambda s: np.round(np.random.randint(-100, 100, s) \
+ np.random.random(s), 2)
df_map = pd.DataFrame(gen_coords((83150, 2)), columns=['X', 'Y'])
df_coords = pd.DataFrame(gen_coords((2702, 2)), columns=['X Rotate', 'Y Rotate'])
Map coordinates:
df_coords['IDX'] = cKDTree(df_map).query(df_coords, k=30)[1].tolist()
df_coords = df_coords.explode('IDX')
df_coords[['X', 'Y']] = df_map.loc[df_coords['IDX'].tolist()].values
df_coords = df_coords.drop(columns='IDX')
Output result:
>>> df_coords
X Rotate Y Rotate X Y
0 99.00 57.35 99.18 57.13
0 99.00 57.35 98.54 57.53
0 99.00 57.35 99.14 58.20
0 99.00 57.35 99.88 57.36
0 99.00 57.35 98.03 56.94
... ... ... ... ...
2701 92.75 -8.69 91.40 -9.74
2701 92.75 -8.69 91.75 -7.29
2701 92.75 -8.69 93.41 -7.09
2701 92.75 -8.69 94.48 -8.78
2701 92.75 -8.69 93.29 -10.36
[81060 rows x 4 columns]
>>> df_coords.value_counts(['X Rotate', 'Y Rotate'])
X Rotate Y Rotate
-99.71 -20.20 30
35.72 85.56 30
34.64 76.37 30
34.76 8.32 30
34.90 -4.75 30
..
-32.69 -44.76 30
-32.66 72.96 30
-32.63 -40.65 30
-32.61 34.91 30
99.89 98.02 30
Length: 2702, dtype: int64
I want to draw a 3D-plane graph using matplotlib. I do not understand why I receive an error to indicate x and y must be the same length.
In [134]: dat_vis
Out[134]:
param_C param_gamma mean_test_score x y
4 1 0.001 0.875129 0 1
5 1 0.0001 0.844759 0 0
6 10 0.001 0.903091 0.00900901 1
7 10 0.0001 0.875191 0.00900901 0
8 100 0.001 0.899622 0.0990991 1
9 100 0.0001 0.902420 0.0990991 0
10 1000 0.001 0.909187 1 1
11 1000 0.0001 0.896094 1 0
In [135]: ax.plot_trisurf(dat_vis.x, dat_vis.y, dat_vis.mean_test_score)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-135-1693be3ae757> in <module>()
----> 1 ax.plot_trisurf(dat_vis.x, dat_vis.y, dat_vis.mean_test_score)
~/anaconda3/lib/python3.6/site-packages/mpl_toolkits/mplot3d/axes3d.py in plot_trisurf(self, *args, **kwargs)
1976 lightsource = kwargs.pop('lightsource', None)
1977
-> 1978 tri, args, kwargs = Triangulation.get_from_args_and_kwargs(*args, **kwargs)
1979 if 'Z' in kwargs:
1980 z = np.asarray(kwargs.pop('Z'))
~/anaconda3/lib/python3.6/site-packages/matplotlib/tri/triangulation.py in get_from_args_and_kwargs(*args, **kwargs)
162 mask = kwargs.pop('mask', None)
163
--> 164 triangulation = Triangulation(x, y, triangles, mask)
165 return triangulation, args, kwargs
166
~/anaconda3/lib/python3.6/site-packages/matplotlib/tri/triangulation.py in __init__(self, x, y, triangles, mask)
53 # No triangulation specified, so use matplotlib._qhull to obtain
54 # Delaunay triangulation.
---> 55 self.triangles, self._neighbors = _qhull.delaunay(x, y)
56 self.is_delaunay = True
57 else:
ValueError: x and y must be 1D arrays of the same length
I get the Dataframe object from the sklearn.model_selection.GridSearchCV(). it return an object dtypes columns, so when i try to use the variance to draw a gragh, it can't be operated well. so if you can't find where the question is ,you should come back and make sure you have a right dtypes.
I have a problem with SciKit Learn.
I'm doing a really simple linear regression problem. Based on input values of Hours Studied & the resulting grade, I want to be able to estimate a students grade, based on how long they study.
In [1]: import pandas as pd
In [2]: path = 'Desktop/hoursgrades.csv'
In [3]: df = pd.read_csv(path)
In [4]: X = df['Hours Studied']
In [5]: y = df['Grade']
In [6]: training_data_in = list()
In [7]: training_data_out = list()
In [8]: training_data_in.append(X)
In [9]: training_data_out.append(y)
In [11]: from sklearn.linear_model import LinearRegression
In [12]: model = LinearRegression(n_jobs =-1)
In [13]: model.fit(X = training_data_in, y = training_data_out)
Out[13]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=-1, normalize=False)
In this example, the DF looks like this:
In [16]: df
Out[16]:
Hours Studied Grade
0 1 10.0
1 2 20.0
2 3 30.0
3 4 40.0
4 5 50.0
5 6 60.0
6 7 70.0
7 8 80.0
8 9 90.0
9 10 100.0
And X looks like this:
In [17]: X
Out[17]:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
Name: Hours Studied, dtype: int64
And y looks like this:
In [18]: y
Out[18]:
0 10.0
1 20.0
2 30.0
3 40.0
4 50.0
5 60.0
6 70.0
7 80.0
8 90.0
9 100.0
Name: Grade, dtype: float64
So far so good, it seems to have accepted everything I've put in so far. So now, I want to test the model with some input data. So, I want to say, the number of hours this student studied is 5 & for the model to tell me the expected grade.
But when I put that into the model, I get the below error.
Can anyone advise?
In [14]: studied_hour = [[5]]
In [15]: outcome = model.predict(X = studied_hour)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-15-6fdab4ae2efd> in <module>()
----> 1 outcome = model.predict(X = studied_hour)
~/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/base.py in predict(self, X)
254 Returns predicted values.
255 """
--> 256 return self._decision_function(X)
257
258 _preprocess_data = staticmethod(_preprocess_data)
~/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/base.py in _decision_function(self, X)
239 X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
240 return safe_sparse_dot(X, self.coef_.T,
--> 241 dense_output=True) + self.intercept_
242
243 def predict(self, X):
~/anaconda3/lib/python3.7/site-packages/sklearn/utils/extmath.py in safe_sparse_dot(a, b, dense_output)
138 return ret
139 else:
--> 140 return np.dot(a, b)
141
142
ValueError: shapes (1,1) and (10,10) not aligned: 1 (dim 1) != 10 (dim 0)
I should add:
In [39]: X.shape
Out[39]: (10,)
In [40]: y.shape
Out[40]: (10,)
The input shape of both X and y is not correct, it has to be (n_samples, n_features) for X and (n_samples,) for y as per the docs.
You see the error because the model thinks you have ten features and ten different outputs (hence the (10, 10)).
You get the correct results by using
X = df[['Hours Studied']] # note the double brackets, shape (10, 1)
y = df['Grade']
model = LinearRegression().fit(X, y)
model.predict([[5]])
array([50.])