I want to draw a 3D-plane graph using matplotlib. I do not understand why I receive an error to indicate x and y must be the same length.
In [134]: dat_vis
Out[134]:
param_C param_gamma mean_test_score x y
4 1 0.001 0.875129 0 1
5 1 0.0001 0.844759 0 0
6 10 0.001 0.903091 0.00900901 1
7 10 0.0001 0.875191 0.00900901 0
8 100 0.001 0.899622 0.0990991 1
9 100 0.0001 0.902420 0.0990991 0
10 1000 0.001 0.909187 1 1
11 1000 0.0001 0.896094 1 0
In [135]: ax.plot_trisurf(dat_vis.x, dat_vis.y, dat_vis.mean_test_score)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-135-1693be3ae757> in <module>()
----> 1 ax.plot_trisurf(dat_vis.x, dat_vis.y, dat_vis.mean_test_score)
~/anaconda3/lib/python3.6/site-packages/mpl_toolkits/mplot3d/axes3d.py in plot_trisurf(self, *args, **kwargs)
1976 lightsource = kwargs.pop('lightsource', None)
1977
-> 1978 tri, args, kwargs = Triangulation.get_from_args_and_kwargs(*args, **kwargs)
1979 if 'Z' in kwargs:
1980 z = np.asarray(kwargs.pop('Z'))
~/anaconda3/lib/python3.6/site-packages/matplotlib/tri/triangulation.py in get_from_args_and_kwargs(*args, **kwargs)
162 mask = kwargs.pop('mask', None)
163
--> 164 triangulation = Triangulation(x, y, triangles, mask)
165 return triangulation, args, kwargs
166
~/anaconda3/lib/python3.6/site-packages/matplotlib/tri/triangulation.py in __init__(self, x, y, triangles, mask)
53 # No triangulation specified, so use matplotlib._qhull to obtain
54 # Delaunay triangulation.
---> 55 self.triangles, self._neighbors = _qhull.delaunay(x, y)
56 self.is_delaunay = True
57 else:
ValueError: x and y must be 1D arrays of the same length
I get the Dataframe object from the sklearn.model_selection.GridSearchCV(). it return an object dtypes columns, so when i try to use the variance to draw a gragh, it can't be operated well. so if you can't find where the question is ,you should come back and make sure you have a right dtypes.
Related
I have an xarray dataset of sea surface temperature values on an x/y grid. x and y are 1D vector coordinates, so it looks like this minimal example:
<xarray.Dataset>
Dimensions: (x: 10, y: 10)
Coordinates:
* x (x) int64 0 1 2 3 4 5 6 7 8 9
* y (y) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
data (x, y) float64 0.559 0.01037 0.1562 ... 0.08778 0.3272 0.8661
I am able to compute the lat/lon from this x/y grid, and the output is 2 2D arrays. I can add them as coordinates with ds.assign_coords:
<xarray.Dataset>
Dimensions: (x: 10, y: 10)
Coordinates:
* x (x) int64 0 1 2 3 4 5 6 7 8 9
* y (y) int64 0 1 2 3 4 5 6 7 8 9
lat (x, y) float64 30.0 30.0 30.0 30.0 30.0 ... 39.0 39.0 39.0 39.0
lon (x, y) float64 -120.0 -119.0 -118.0 -117.0 ... -113.0 -112.0 -111.0
Data variables:
data (x, y) float64 0.559 0.01037 0.1562 ... 0.08778 0.3272 0.8661
But I'd like to .sel along slices of the lat/lon. This currently isn't possible, as I get the error:
ds.sel(lat=slice(32,36), lon=slice(-118, -115))
ValueError Traceback (most recent call last)
<ipython-input-20-28c79202d5f3> in <module>
----> 1 ds.sel(lat=slice(32,36), lon=slice(-118, -115))
~/.local/lib/python3.8/site-packages/xarray/core/dataset.py in sel(self, indexers, method, tolerance, drop, **indexers_kwargs)
2363 """
2364 indexers = either_dict_or_kwargs(indexers, indexers_kwargs, "sel")
-> 2365 pos_indexers, new_indexes = remap_label_indexers(
2366 self, indexers=indexers, method=method, tolerance=tolerance
2367 )
~/.local/lib/python3.8/site-packages/xarray/core/coordinates.py in remap_label_indexers(obj, indexers, method, tolerance, **indexers_kwargs)
419 }
420
--> 421 pos_indexers, new_indexes = indexing.remap_label_indexers(
422 obj, v_indexers, method=method, tolerance=tolerance
423 )
~/.local/lib/python3.8/site-packages/xarray/core/indexing.py in remap_label_indexers(data_obj, indexers, method, tolerance)
256 new_indexes = {}
257
--> 258 dim_indexers = get_dim_indexers(data_obj, indexers)
259 for dim, label in dim_indexers.items():
260 try:
~/.local/lib/python3.8/site-packages/xarray/core/indexing.py in get_dim_indexers(data_obj, indexers)
222 ]
223 if invalid:
--> 224 raise ValueError(f"dimensions or multi-index levels {invalid!r} do not exist")
225
226 level_indexers = defaultdict(dict)
ValueError: dimensions or multi-index levels ['lat', 'lon'] do not exist
So my question is this: How can I change the dimensions of data to be (lon: (10,10), lat: (10,10)) instead of (x: 10, y: 10? Is this even possible?
Code to reproduce the example dataset:
import numpy as np
import xarray as xr
# Create sample data
data = np.random.rand(10,10)
x = y = np.arange(10)
# Set up dataset
ds = xr.Dataset(
data_vars = dict(
data = (["x", "y"], data)
),
coords= {
"x" : x,
"y" : y
}
)
# Create example lat/lon and assign to dataset
lon, lat = np.meshgrid(np.linspace(-120, -111, 10), np.linspace(30, 39, 10))
ds = ds.assign_coords({
"lat": (["x", "y"], lat),
"lon": (["x", "y"], lon)
})
I have a pandas dataframe which I am storing information about different objects in a video.
For each frame of the video I'm saving the positions of the objects in a dataframe with columns 'x', 'y' 'particle' with the frame number in the index:
x y particle
frame
0 588 840 0
0 260 598 1
0 297 1245 2
0 303 409 3
0 307 517 4
This works fine but I want to save information about each frame of the video, e.g. the temperature at each frame.
I'm currently doing this by creating a series with the values for each frame and the index containing the frame number then adding the series to the dataframe.
prop = pd.Series(temperature_values,
index=pd.Index(np.arange(len(temperature_values)), name='frame')
df['temperature'] = prop
This works but produces duplicates of the data in every row of the column:
x y particle temperature
frame
0 588 840 0 12
0 260 598 1 12
0 297 1245 2 12
0 303 409 3 12
0 307 517 4 12
Is there anyway of saving this information without duplicates in the current dataframe so that when I try and get the temperature column I just receive the original series that I created?
If there isn't anyway of doing this my plan is to either deal with the duplicates using drop_duplicates or create a second dataframe with just the data for each frame which I can then merge into my first dataframe but I'd like to avoid doing this if possible.
Here is the current code with jupyter outputs formatted as best as I can:
import pandas as pd
import numpy as np
df = pd.DataFrame()
frames = list(range(5))
for f in frames:
x = np.random.randint(10, 100, size=10)
y = np.random.randint(10, 100, size=10)
particle = np.arange(10)
data = {
'x': x,
'y': y,
'particle': particle,
'frame': f}
df_to_append = pd.DataFrame(data)
df = df.append(df_to_append)
print(df.head())
Output:
x y particle frame
0 61 97 0 0
1 49 73 1 0
2 48 72 2 0
3 59 37 3 0
4 39 64 4 0
Input
df = df.set_index('frame')
print(df.head())
Output
x y particle
frame
0 61 97 0
0 49 73 1
0 48 72 2
0 59 37 3
0 39 64 4
Input:
example_data = [10*f for f in frames]
# Current method
prop = pd.Series(example_data, index=pd.Index(np.arange(len(example_data)), name='frame'))
df['data1'] = prop
print(df.head())
print(df.tail())
Output:
x y particle data1
frame
0 61 97 0 0
0 49 73 1 0
0 48 72 2 0
0 59 37 3 0
0 39 64 4 0
x y particle data1
frame
4 25 93 5 40
4 28 17 6 40
4 39 15 7 40
4 28 47 8 40
4 12 56 9 40
Input:
# Proposed method
df['data2'] = example_data
Output:
ValueError Traceback (most recent call last)
<ipython-input-12-e41b12bbe1cd> in <module>
1 # Proposed method
----> 2 df['data2'] = example_data
~/miniconda3/envs/ParticleTracking/lib/python3.7/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
3368 else:
3369 # set column
-> 3370 self._set_item(key, value)
3371
3372 def _setitem_slice(self, key, value):
~/miniconda3/envs/ParticleTracking/lib/python3.7/site-packages/pandas/core/frame.py in _set_item(self, key, value)
3443
3444 self._ensure_valid_index(value)
-> 3445 value = self._sanitize_column(key, value)
3446 NDFrame._set_item(self, key, value)
3447
~/miniconda3/envs/ParticleTracking/lib/python3.7/site-packages/pandas/core/frame.py in _sanitize_column(self, key, value, broadcast)
3628
3629 # turn me into an ndarray
-> 3630 value = sanitize_index(value, self.index, copy=False)
3631 if not isinstance(value, (np.ndarray, Index)):
3632 if isinstance(value, list) and len(value) > 0:
~/miniconda3/envs/ParticleTracking/lib/python3.7/site-packages/pandas/core/internals/construction.py in sanitize_index(data, index, copy)
517
518 if len(data) != len(index):
--> 519 raise ValueError('Length of values does not match length of index')
520
521 if isinstance(data, ABCIndexClass) and not copy:
ValueError: Length of values does not match length of index
I am afraid you cannot. All columns in a DataFrame share the same index and are required to have same length. But coming from the database world, I try to avoid as much as possible indexes with duplicate values.
I have a problem with SciKit Learn.
I'm doing a really simple linear regression problem. Based on input values of Hours Studied & the resulting grade, I want to be able to estimate a students grade, based on how long they study.
In [1]: import pandas as pd
In [2]: path = 'Desktop/hoursgrades.csv'
In [3]: df = pd.read_csv(path)
In [4]: X = df['Hours Studied']
In [5]: y = df['Grade']
In [6]: training_data_in = list()
In [7]: training_data_out = list()
In [8]: training_data_in.append(X)
In [9]: training_data_out.append(y)
In [11]: from sklearn.linear_model import LinearRegression
In [12]: model = LinearRegression(n_jobs =-1)
In [13]: model.fit(X = training_data_in, y = training_data_out)
Out[13]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=-1, normalize=False)
In this example, the DF looks like this:
In [16]: df
Out[16]:
Hours Studied Grade
0 1 10.0
1 2 20.0
2 3 30.0
3 4 40.0
4 5 50.0
5 6 60.0
6 7 70.0
7 8 80.0
8 9 90.0
9 10 100.0
And X looks like this:
In [17]: X
Out[17]:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
Name: Hours Studied, dtype: int64
And y looks like this:
In [18]: y
Out[18]:
0 10.0
1 20.0
2 30.0
3 40.0
4 50.0
5 60.0
6 70.0
7 80.0
8 90.0
9 100.0
Name: Grade, dtype: float64
So far so good, it seems to have accepted everything I've put in so far. So now, I want to test the model with some input data. So, I want to say, the number of hours this student studied is 5 & for the model to tell me the expected grade.
But when I put that into the model, I get the below error.
Can anyone advise?
In [14]: studied_hour = [[5]]
In [15]: outcome = model.predict(X = studied_hour)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-15-6fdab4ae2efd> in <module>()
----> 1 outcome = model.predict(X = studied_hour)
~/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/base.py in predict(self, X)
254 Returns predicted values.
255 """
--> 256 return self._decision_function(X)
257
258 _preprocess_data = staticmethod(_preprocess_data)
~/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/base.py in _decision_function(self, X)
239 X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
240 return safe_sparse_dot(X, self.coef_.T,
--> 241 dense_output=True) + self.intercept_
242
243 def predict(self, X):
~/anaconda3/lib/python3.7/site-packages/sklearn/utils/extmath.py in safe_sparse_dot(a, b, dense_output)
138 return ret
139 else:
--> 140 return np.dot(a, b)
141
142
ValueError: shapes (1,1) and (10,10) not aligned: 1 (dim 1) != 10 (dim 0)
I should add:
In [39]: X.shape
Out[39]: (10,)
In [40]: y.shape
Out[40]: (10,)
The input shape of both X and y is not correct, it has to be (n_samples, n_features) for X and (n_samples,) for y as per the docs.
You see the error because the model thinks you have ten features and ten different outputs (hence the (10, 10)).
You get the correct results by using
X = df[['Hours Studied']] # note the double brackets, shape (10, 1)
y = df['Grade']
model = LinearRegression().fit(X, y)
model.predict([[5]])
array([50.])
So I'm using sci-kit learn to classify some data. I have 13 different class values/categorizes to classify the data to. Now I have been able to use cross validation and print the confusion matrix. However, it only shows the TP and FP etc without the classlabels, so I don't know which class is what. Below is my code and my output:
def classify_data(df, feature_cols, file):
nbr_folds = 5
RANDOM_STATE = 0
attributes = df.loc[:, feature_cols] # Also known as x
class_label = df['task'] # Class label, also known as y.
file.write("\nFeatures used: ")
for feature in feature_cols:
file.write(feature + ",")
print("Features used", feature_cols)
sampler = RandomOverSampler(random_state=RANDOM_STATE)
print("RandomForest")
file.write("\nRandomForest")
rfc = RandomForestClassifier(max_depth=2, random_state=RANDOM_STATE)
pipeline = make_pipeline(sampler, rfc)
class_label_predicted = cross_val_predict(pipeline, attributes, class_label, cv=nbr_folds)
conf_mat = confusion_matrix(class_label, class_label_predicted)
print(conf_mat)
accuracy = accuracy_score(class_label, class_label_predicted)
print("Rows classified: " + str(len(class_label_predicted)))
print("Accuracy: {0:.3f}%\n".format(accuracy * 100))
file.write("\nClassifier settings:" + str(pipeline) + "\n")
file.write("\nRows classified: " + str(len(class_label_predicted)))
file.write("\nAccuracy: {0:.3f}%\n".format(accuracy * 100))
file.writelines('\t'.join(str(j) for j in i) + '\n' for i in conf_mat)
#Output
Rows classified: 23504
Accuracy: 17.925%
0 372 46 88 5 73 0 536 44 317 0 200 127
0 501 29 85 0 136 0 655 9 154 0 172 67
0 97 141 78 1 56 0 336 37 429 0 435 198
0 135 74 416 5 37 0 507 19 323 0 128 164
0 247 72 145 12 64 0 424 21 296 0 304 223
0 190 41 36 0 178 0 984 29 196 0 111 43
0 218 13 71 7 52 0 917 139 177 0 111 103
0 215 30 84 3 71 0 1175 11 55 0 102 62
0 257 55 156 1 13 0 322 184 463 0 197 160
0 188 36 104 2 34 0 313 99 827 0 69 136
0 281 80 111 22 16 0 494 19 261 0 313 211
0 207 66 87 18 58 0 489 23 157 0 464 239
0 113 114 44 6 51 0 389 30 408 0 338 315
As you can see, you can't really know what column is what and the print is also "misaligned" so it's difficult to understand.
Is there a way to print the labels as well?
From the doc, it seems that there is no such option to print the rows and column labels of the confusion matrix. However, you can specify the label order using argument labels=...
Example:
from sklearn.metrics import confusion_matrix
y_true = ['yes','yes','yes','no','no','no']
y_pred = ['yes','no','no','no','no','no']
print(confusion_matrix(y_true, y_pred))
# Output:
# [[3 0]
# [2 1]]
print(confusion_matrix(y_true, y_pred, labels=['yes', 'no']))
# Output:
# [[1 2]
# [0 3]]
If you want to print the confusion matrix with labels, you may try pandas and set the index and columns of the DataFrame.
import pandas as pd
cmtx = pd.DataFrame(
confusion_matrix(y_true, y_pred, labels=['yes', 'no']),
index=['true:yes', 'true:no'],
columns=['pred:yes', 'pred:no']
)
print(cmtx)
# Output:
# pred:yes pred:no
# true:yes 1 2
# true:no 0 3
Or
unique_label = np.unique([y_true, y_pred])
cmtx = pd.DataFrame(
confusion_matrix(y_true, y_pred, labels=unique_label),
index=['true:{:}'.format(x) for x in unique_label],
columns=['pred:{:}'.format(x) for x in unique_label]
)
print(cmtx)
# Output:
# pred:no pred:yes
# true:no 3 0
# true:yes 2 1
It is important to ensure that the way you label your confusion matrix rows and columns corresponds exactly to the way sklearn has coded the classes. The true order of the labels can be revealed using the .classes_ attribute of the classifier. You can use the code below to prepare a confusion matrix data frame.
labels = rfc.classes_
conf_df = pd.DataFrame(confusion_matrix(class_label, class_label_predicted, columns=labels, index=labels))
conf_df.index.name = 'True labels'
The second thing to note is that your classifier is not predicting labels well. The number of correctly predicted labels is shown on the main diagonal of the confusion matrix. You have non-zero values accross the matrix and some classes have not been predicted at all - the columns that are all zero. It might be a good idea to run the classifier with its default parameters and then try to optimise them.
Another better way of doing this is using crosstab function in pandas.
pd.crosstab(y_true, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)
or
pd.crosstab(le.inverse_transform(y_true),
le.inverse_transform(y_pred),
rownames=['True'],
colnames=['Predicted'],
margins=True)
Since confusion matrix is just a numpy matrix, it does not contain any column information. What you can do is convert your matrix into a dataframe and then print this dataframe.
import pandas as pd
import numpy as np
def cm2df(cm, labels):
df = pd.DataFrame()
# rows
for i, row_label in enumerate(labels):
rowdata={}
# columns
for j, col_label in enumerate(labels):
rowdata[col_label]=cm[i,j]
df = df.append(pd.DataFrame.from_dict({row_label:rowdata}, orient='index'))
return df[labels]
cm = np.arange(9).reshape((3, 3))
df = cm2df(cm, ["a", "b", "c"])
print(df)
Code snippet is from https://gist.github.com/nickynicolson/202fe765c99af49acb20ea9f77b6255e
Output:
a b c
a 0 1 2
b 3 4 5
c 6 7 8
It appears your data has 13 different classes, which is why your confusion matrix has 13 rows and columns. Furthermore, your classes aren't labeled in any way, just integers from what I can see.
If this isn't the case, and your training data has actual labels, you can pass a list of unique labels to confusion_matrix
conf_mat = confusion_matrix(class_label, class_label_predicted, df['task'].unique())
I'm a bit confused on an error I keep running into. I didn't have it before, but at the same time my data was wrong so I had to re-write the code.
Running the following:
plt.figure(figsize=(20,10))
x = np.arange(1416, 1426, 0.009766)
gaverage = np.empty((21,1024), dtype = np.float64)
calibdata = open(pathc + 'calib_5m.dat').readlines()
#print(np.size(calibdata)) ||| Yields: 624
#print(np.size(calibdata)//16) ||| Yields: 39
calib = np.empty(shape=(np.size(calibdata)//16,1024), dtype=np.float64)
for i in range(0, np.size(calibdata)//4):
calib[i] = calibdata[i*4+3].split()
caverage = np.average(calib[i] ,axis = 0)
Yields this:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-25-87f3f4739851> in <module>()
11 calib = np.empty(shape=(np.size(calibdata)//16,1024), dtype=np.float64)
12 for i in range(0, np.size(calibdata)//4):
---> 13 calib[i] = calibdata[i*4+3].split()
14 caverage = np.average(calib[i] ,axis = 0)
15
IndexError: index 39 is out of bounds for axis 0 with size 39
Now what I'm trying to do here is basically take every 4th line in the file read in calibdata and write it to a new array, calib[i]. If the indices are the same size how are they out of bounds? I think there's some fundamentally flawed logic here on my part so if anyone can point out where I'm falling short, that would be great.
calib is initialized to size (39,n). But i iterator goes well beyond that:
In [243]: for i in range(np.size(calibdata)//4):
...: print(i, i*4+3)
...:
0 3
1 7
2 11
3 15
4 19
5 23
6 27
7 31
8 35
....
147 591
148 595
149 599
150 603
151 607
152 611
153 615
154 619
155 623
In [244]: calib=np.zeros((np.size(calibdata)//16),int)
In [245]: calib.shape
Out[245]: (39,)