How to write in GrADS-readable binary format using python? - python

I have been trying to export data stored as numpy array to GrADS-flat binary.
It seems GrADS does not recognize Z dimension given in the .ctl file.
Whatever value I use for 'set z integer', GrADS only shows the first level.
Here is the minimal reproduction of my problem
my python code:
import numpy as np
from array import array
data = np.linspace(1, 60, num=60, endpoint=True)
data = np.reshape(data, [5, 4, 3])
print(data)
with open('temp.dat', 'ab') as wf:
float_array = array('f', data.flatten())
float_array.tofile(wf)
wf.close()
executing this writes numbers [1, 2, 3, ..., 60] as single-precision float to a binary file
my .ctl file:
DSET ^temp.dat
TITLE title
UNDEF -9.99E33
XDEF 3 LINEAR 0.0 1
YDEF 4 LINEAR 0.0 1
ZDEF 5 LEVELS 0 1 2 3 4
TDEF 1 LINEAR 0Z10apr1991 12hr
VARS 1
var 0 99 some var
ENDVARS
This set of .dat and .ctl files shows first 12 numbers as first level of the field as expected.
ga-> open temp.ctl
Scanning description file: temp.ctl
Data file temp.dat is open as file 1
LON set to 0 2
LAT set to 0 3
LEV set to 0 0
Time values set: 1991:4:10:0 1991:4:10:0
E set to 1 1
ga-> set digsize 0.6
digsiz = 0.6
ga-> set lon -1 3
LON set to -1 3
ga-> set lat -1 4
LAT set to -1 4
ga-> set gxout grid
ga-> d var
However, if I try to 'set z 2' it still shows the first level.
Orelse, var(z=3) and var(z=1) being
var(z=1)=
[[ 1. 2. 3.]
[ 4. 5. 6.]
[ 7. 8. 9.]
[10. 11. 12.]]
var(z=3)=
[[25. 26. 27.]
[28. 29. 30.]
[31. 32. 33.]
[34. 35. 36.]]
this should show the constant field of "24" ... but GrADS is showing "0" as if z=3 is the same with z=1.
ga-> c
ga-> d var(z=3) - var(z=1)
What is even more ununderstantable is that if I give var2 in .ctl file,
grads recognize what it should be var(z=2) as var2(z=1)!
I know there are lots of better visualization tools other than GrADS, but I want to use legacy GrADS codes so it is unavoidable.
Did I write the binary in the wrong order? or is the binary file missing header or separator or something?
I am grad if anyone knows why it is happening.
Thanks in advance.

My colleague has just pointed out that the problem is in .ctl file.
I have set the layer number to zero, which GrADS recognize as "surface layer",
which is a single layer variable field and can be overlayed upon any layer above it.
I am able to show any layer of the field with the corrected .ctl file.
my corrected .ctl.
DSET ^temp.dat
TITLE title
UNDEF -9.99E33
XDEF 3 LINEAR 0.0 1
YDEF 4 LINEAR 0.0 1
ZDEF 5 LEVELS 0 1 2 3 4
TDEF 1 LINEAR 0Z10apr1991 12hr
VARS 1
// this should be 5 not zero because there are five layers!
// var 0 99 some var
var 5 99 some var
ENDVARS

Related

Montecarlo continuation of multicolumn pandas timeseries

I have a bunch of data points in a timeseries in a pandas dataframe. Each column is supposedly independent of each other. I want to create a montecarlo process to calculate expected values for each of the columns. For that, my expectation is that the underlying data follows a brownian motion pattern, so I'd need to generate a normal distribution over the differences between points in time space.
I transform my data like this:
diffs = (data.diff() / data.shift(1))
This is what I have at the moment:
data = diffs.describe()
This gives the following output:
A B C
count 4986.000000 4963.000000 1861.000000
mean 0.000285 0.000109 0.000421
std 0.015759 0.015426 0.014676
...
I process it like this to generate more samples:
import numpy as np
desired_samples = 1000
random = np.random.default_rng().normal(loc=[data.loc[["mean"]].to_numpy()], scale=[data.loc[["std"]].to_numpy()], size=[len(data.columns), desired_samples])
However this gives me an error:
ValueError: shape mismatch: objects cannot be broadcast to a single shape. Mismatch is between arg 0 with shape (441, 1000) and arg 1 with shape (1, 1, 441).
What I'd want is just a matrix of random values whose columns have the same std and mean as the sample's columns. I.e. such as when I do random.describe(), I'd get something like:
A B C
count 1000.0 1000.0 1000.0
mean 0.000285 0.000109 0.000421
std 0.015759 0.015426 0.014676
...
What'd be the correct way to generate those samples?
You could use apply() to create a data frame of random normal values using the attributes of the associated columns.
Generate Test Data
nv = 50
d = {'A':np.random.normal(1,1,nv),'B':np.random.normal(2,2,nv),'C':np.random.normal(3,3,nv)}
df = pd.DataFrame(d)
print(df)
A B C
0 0.276252 -2.833479 5.746740
1 1.562030 1.497242 2.557416
2 0.883105 -0.861824 3.106192
3 0.352372 0.014653 4.006219
4 1.475524 3.151062 -1.392998
5 2.011649 -2.289844 4.371251
6 3.230964 3.578058 0.610422
7 0.366506 3.391327 0.812932
8 1.669673 -1.021665 4.262500
9 1.835547 4.292063 6.983015
10 1.768208 4.029970 3.971751
...
45 0.501706 0.926860 7.008008
46 1.759266 -0.215047 4.560403
47 1.899167 0.690204 -0.538415
48 1.460267 1.506934 1.306303
49 1.641662 1.066182 0.049233
df.describe()
A B C
count 50.000000 50.000000 50.000000
mean 0.962083 1.522234 2.992492
std 1.073733 1.848754 2.838976
Generate Random Values with same approx (calculated) Mean and STD
mat = df.apply(lambda x: np.random.normal(x.mean(),x.std(),100))
print(mat)
A B C
0 0.234955 2.201961 1.910073
1 1.973203 3.528576 5.925673
2 -0.858201 2.234295 1.741338
3 2.245650 2.805498 0.135784
4 1.913691 2.134813 2.246989
.. ... ... ...
95 2.996207 2.248727 2.792658
96 0.663609 4.533541 1.518872
97 0.848259 -0.348086 2.271724
98 3.672370 1.706185 -0.862440
99 0.392051 0.832358 -0.354981
[100 rows x 3 columns]
mat.describe()
A B C
count 100.000000 100.000000 100.000000
mean 0.877725 1.332039 2.673327
std 1.148153 1.749699 2.447532
If you want the matrix to be numpy
mat.to_numpy()
array([[ 0.78881292, 3.09428714, -1.22757096],
[ 0.13044099, -1.02564025, 2.6566989 ],
[ 0.06090083, 1.50629474, 3.61487469],
[ 0.71418932, 1.88441111, 5.84979454],
[ 2.34287411, 2.58478867, -4.04433653],
[ 1.41846256, 0.36414635, 8.47482082],
[ 0.46765842, 1.37188986, 3.28011085],
[ 0.87433273, 3.45735286, 1.13351138],
[ 1.59029413, 4.0227165 , 3.58282534],
[ 2.23663894, 2.75007385, -0.36242541],
[ 1.80967311, 1.29206572, 1.73277577],
[ 1.20787923, 2.75529187, 4.64721489],
[ 2.33466341, 6.43830387, 4.31354348],
[ 0.87379125, 3.00658046, 4.94270155],
etc ...

How to calculate an 0-1 certainty score for DecisionTreeClassifier?

Dataset
0-9 columns: float features (parameters of a product)
10 column: int labels (products)
Goal
Calculate an 0-1 classification certainty score for the labels (this is what my current code should do)
Calculate the same certainty score for each “product_name”(300 columns) at each rows(22'000)
ERROR I use sklearn.tree.DecisionTreeClassifier.
I am trying to use "predict_proba" but it gives an error.
Python CODE
data_train = pd.read_csv('data.csv')
features = data_train.columns[:-1]
labels = data_train.columns[-1]
x_features = data_train[features]
x_label = data_train[labels]
X_train, X_test, y_train, y_test = train_test_split(x_features, x_label, random_state=0)
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
clf = DecisionTreeClassifier(max_depth=3).fit(X_train, y_train)
class_probabilitiesDec = clf.predict_proba(y_train)
#ERORR: ValueError: Number of features of the model must match the input. Model n_features is 10 and input n_features is 16722
print('Decision Tree Classification Accuracy Training Score (max_depth=3): {:.2f}'.format(clf.score(X_train, y_train)*100) + ('%'))
print('Decision Tree Classification Accuracy Test Score (max_depth=3): {:.2f}'.format(clf.score(X_test, y_test)*100) + ('%'))
print(class_probabilitiesDec[:10])
# if I use X_tranin than it jsut prints out a buch of 41 element vectors: [[ 0.00490808 0.00765327 0.01123035 0.00332751 0.00665502 0.00357707
0.05182597 0.03169453 0.04267532 0.02761833 0.01988187 0.01281091
0.02936528 0.03934781 0.02329257 0.02961484 0.0353548 0.02503951
0.03577073 0.04700108 0.07661592 0.04433907 0.03019715 0.02196157
0.0108976 0.0074869 0.0291989 0.03951418 0.01372598 0.0176358
0.02345895 0.0169703 0.02487314 0.01813493 0.0482489 0.01988187
0.03252641 0.01572249 0.01455786 0.00457533 0.00083188]
[....
FEATURES (COLUMNS)
(last columns are the labels)
0 1 1 1 1.0 1462293561 1462293561 0 0 0.0 0.0 1
1 2 2 2 8.0 1460211580 1461091152 1 1 0.0 0.0 2
2 3 3 3 1.0 1469869039 1470560880 1 1 0.0 0.0 3
3 4 4 4 1.0 1461482675 1461482675 0 0 0.0 0.0 4
4 5 5 5 5.0 1462173043 1462386863 1 1 0.0 0.0 5
CLASSES COLUMNS (300 COLUMNS OF ITEMS)
HEADER ROW: apple gameboy battery ....
SCORE in 1st row: 0.763 0.346 0.345 ....
SCORE in 2nd row: 0.256 0.732 0.935 ....
ex.: of similar scores used when someone image classify cat VS. dog and the classification gives confidence scores.
You cannot predict the probability of your labels.
predict_proba predicts the probability for each label from your X Data, thus:
class_probabilitiesDec = clf.predict_proba(X_test)
What you postet as "when i use X_train":
[[ 0.00490808 0.00765327 0.01123035 0.00332751 0.00665502 0.00357707
0.05182597 0.03169453 0.04267532 0.02761833 0.01988187 0.01281091
0.02936528 0.03934781 0.02329257 0.02961484 0.0353548 0.02503951
0.03577073 0.04700108 0.07661592 0.04433907 0.03019715 0.02196157
0.0108976 0.0074869 0.0291989 0.03951418 0.01372598 0.0176358
0.02345895 0.0169703 0.02487314 0.01813493 0.0482489 0.01988187
0.03252641 0.01572249 0.01455786 0.00457533 0.00083188]
Is a list of the probability to be true for every possible label.
EDIT
After reading your comments predict proba is exactly what you want.
Lets make an example. In the following code we have a classifier with 3 classes: either 11, 12 or 13.
If the input is 1 the classifier should predict 11
If the input is 2 the classifier should predict 12
...
If the input is 7 the classifier should predict 13
clf = DecisionTreeClassifier()
clf.fit([[1],[2],[3],[4],[5],[6],[7]], [[11],[12],[13],[13],[12],[11],[13]])
now if you have test data with a single row e.g. 5 than the classifier should predict 12. So lets try that.
clf.predict([[5]])
And voila: the result is array([12])
if we want a probability then predict proba is the way to go:
clf.predict_proba([[5]])
and we get [array([0., 1., 0.])]
In that case the array [0., 1., 0.] means :
0% probability for class 11
100% probability for class 12
0% probability for class 13
If i'm correct thats exactly what you want.
You can even map that to the names of your classes with:
probabilities = clf.predict_proba([[5]])[0]
{clf.classes_[i] : probabilities[i] for i in range(len(probabilities))}
which gives you a dictionary with probabilities for class names:
{11: 0.0, 12: 1.0, 13: 0.0}
Now in your case you have way more classes than only [11,12,13] so the array gets longer. And for every row in your dataset predict_proba creates an array, so for more than a single row of data your output becomes a matrix.

Why does scipy's genfromtxt read NaN values as -1?

import scipy as sp
data = sp.genfromtxt(r"C:\Users\DELL INSPIRON N3542\Downloads/1400OS_Code/1400OS_01_Codes/data/web_traffic.tsv" , "\t")
print(data[:24])
[[ 1 2272]
[ 2 -1]
[ 3 1386]
[ 4 1365]
[ 5 1488]
[ 6 1337]
...
and the original Data set looks like that
1 2272
2 nan
3 1386
4 1365
5 1488
6 1337
...
and instead of this -1 i should get a NaN , in the original Data set there's a NaN.
You are passing "\t" as a datatype, not the delimiter.
Try instead:
import scipy as sp
data = sp.genfromtxt(r"C:\Users\DELL INSPIRON N3542\Downloads/1400OS_Code/1400OS_01_Codes/data/web_traffic.tsv", delimiter="\t")
I think you get the -1 instead of nan, because by design, integers do not support nan, only floats do.
I am not sure why passing "\t" as a data type is understood as int64 and does not raise an error.

How to save Polys of vtkPolyData to a database and assign them to vtkPoints?

I want to save vtkPolyData to a database, to give me an idea How to do that I followed the example below (it creates some points, then exports them to a vtp file)
#include <vtkVersion.h>
#include <vtkCellArray.h>
#include <vtkPoints.h>
#include <vtkXMLPolyDataWriter.h>
#include <vtkPolyData.h>
#include <vtkSmartPointer.h>
int main ( int, char *[] )
{
// Create 10 points.
vtkSmartPointer<vtkPoints> points =
vtkSmartPointer<vtkPoints>::New();
for ( unsigned int i = 0; i < 10; ++i )
{
points->InsertNextPoint ( i, i, i );
}
// Create a polydata object and add the points to it.
vtkSmartPointer<vtkPolyData> polydata =
vtkSmartPointer<vtkPolyData>::New();
polydata->SetPoints(points);
// Write the file
vtkSmartPointer<vtkXMLPolyDataWriter> writer =
vtkSmartPointer<vtkXMLPolyDataWriter>::New();
writer->SetFileName("test.vtp");
#if VTK_MAJOR_VERSION <= 5
writer->SetInput(polydata);
#else
writer->SetInputData(polydata);
#endif
// Optional - set the mode. The default is binary.
//writer->SetDataModeToBinary();
//writer->SetDataModeToAscii();
writer->Write();
return EXIT_SUCCESS;
}
Now exporting data I am working on I realized, data is saved in the following fashion:
<?xml version="1.0"?>
<VTKFile type="PolyData" version="0.1" byte_order="LittleEndian" compressor="vtkZLibDataCompressor">
<PolyData>
<Piece NumberOfPoints="290" NumberOfVerts="0" NumberOfLines="0" NumberOfStrips="0" NumberOfPolys="321">
<PointData>
</PointData>
<CellData>
</CellData>
<Points>
<DataArray type="Float32" Name="Points" NumberOfComponents="3" format="ascii" RangeMin="6796534.9032" RangeMax="6805936.2466">
1520 1520 93.9992676 1567 1520 93.9992676
1567 1612 93.9992676 1520 1612 93.9992676
...
</DataArray>
</Points>
<Polys>
<DataArray type="Int32" Name="connectivity" format="ascii" RangeMin="0" RangeMax="29031">
0 1 2 3 1 4
5 2 4 6 7 5
6 8 9 7 8 10
...
</DataArray>
</Polys>
</Piece>
</PolyData>
</VTKFile>
So I thougt creating three tables
Object
IdObject idPoints idPolys
1 1 1
then following table would have idPoints equal to 1 to relate it to #Object table
#Points
Id X Y Z
1 1520 1520 93.9992676
2 1567 1520 93.9992676
3 1567 1612 93.9992676
4 1520 1612 93.9992676
....
However I do not know how to store and even assign the polys to those points.
As far as I understand the polys give a geometry to the points connecting them right?
What would be the best way to store polys and also How to assign them to vtkPoints stored in a table?
#Polys
Id ???????
1 0 1 2 3 1 4
5 2 4 6 7 5
6 8 9 7 8 10
....
I am not quite sure what you mean by "assigning polys to vtkPoints". The Polys array represents polygons of the mesh, each of them being a vtkCell, which is basically just a set of points indices defining what points this polygon is made of and a npts - number of points this cell is made of. So you usually assign point indices to polys, not the other way around.
The point indices is what is stored in the connectivity array in your .vtu file. It should also be accompanied by an "offset" array describing where in the "connectivity" array starts an individual cell by defining offsets to that array. Theoretically it is not needed if your mesh is made of only one type of polys (e.g. only triangles), but in general that does not have to be true, so there should be an array like:
<DataArray type="Int32" Name="offsets" format="ascii" RangeMin="0" RangeMax="123456">
3 6 10 13 ...
</DataArray>
giving you first polygon with point indices 0 1 2, second with 3 1 4, third with 5 2 4 6 (a quad - can happen...), fourth with 7 5 6 etc.
Anyway, to use this kind of arrangement, your database would then look like:
#Polys
Id pointIDList
0 0 1 2
1 3 1 4
2 5 2 4 6
....
where the indices written in the pointIDList essentially point to your #Points table (assuming you change it to use zero-based indices ;) ). It is a reasonable arrangement, although I do not quite see what benefit you get over storing the vtu files...but I guess that is your concern.
If by "assigning polys to vtkPoints" you mean that you would like for each point to have a list of polys that are adjacent to it (that point is used to create that polygon), I would still use the #Polys database as desribed above and then added aditional table (or column to the #Points) with adjacency list, keeping indices of polys adjacent to the vertex. You can get that list from vtkPolyData by calling getPointCells(pointID, pointerToTheListToFill).

Python: Sending discontinuous data with mpi4py

I have a C-ordered matrix of dimensions (N,M)
mat = np.random.randn(N, M)
of which I want to send a column through a persistent MPI request to another node. However, using mpi4py,
sreq = MPI.COMM_WORLD.Send_Init((mat[:,idx], MPI.DOUBLE), send_id, tag)
fails on account of the slice being non-contiguous. Can someone suggest a way of going about this? I believe in C that MPI_Type_vector allows for one to specify a stride when creating a type. How can I accomplish this with mpi4py?
create a sendbuffer!
look at this example:
1 #!/usr/bin/python2
2 # -*- coding: utf-8 -*-
3
4 from mpi4py import MPI
5 import numpy as np
6
7 comm = MPI.COMM_WORLD
8 rank = comm.Get_rank()
9
10 matrix = np.empty((5, 10), dtype='f')
11 for y in xrange(len(matrix)):
12 for x in xrange(len(matrix[0])):
13 matrix[y,x] = rank * 10 + x * y
14
15 sendbuf = np.empty(5, dtype='f')
16
17 #column 1
18 sendbuf[:] = matrix[:,1]
19
20 result = comm.gather(sendbuf, root=0)
21
22 if rank == 0:
23 for res in result:
24 print res
this will give you:
$ mpirun -np 4 column.py
[ 0. 1. 2. 3. 4.]
[ 10. 11. 12. 13. 14.]
[ 20. 21. 22. 23. 24.]
[ 30. 31. 32. 33. 34.]

Categories