Python: Sending discontinuous data with mpi4py - python

I have a C-ordered matrix of dimensions (N,M)
mat = np.random.randn(N, M)
of which I want to send a column through a persistent MPI request to another node. However, using mpi4py,
sreq = MPI.COMM_WORLD.Send_Init((mat[:,idx], MPI.DOUBLE), send_id, tag)
fails on account of the slice being non-contiguous. Can someone suggest a way of going about this? I believe in C that MPI_Type_vector allows for one to specify a stride when creating a type. How can I accomplish this with mpi4py?

create a sendbuffer!
look at this example:
1 #!/usr/bin/python2
2 # -*- coding: utf-8 -*-
3
4 from mpi4py import MPI
5 import numpy as np
6
7 comm = MPI.COMM_WORLD
8 rank = comm.Get_rank()
9
10 matrix = np.empty((5, 10), dtype='f')
11 for y in xrange(len(matrix)):
12 for x in xrange(len(matrix[0])):
13 matrix[y,x] = rank * 10 + x * y
14
15 sendbuf = np.empty(5, dtype='f')
16
17 #column 1
18 sendbuf[:] = matrix[:,1]
19
20 result = comm.gather(sendbuf, root=0)
21
22 if rank == 0:
23 for res in result:
24 print res
this will give you:
$ mpirun -np 4 column.py
[ 0. 1. 2. 3. 4.]
[ 10. 11. 12. 13. 14.]
[ 20. 21. 22. 23. 24.]
[ 30. 31. 32. 33. 34.]

Related

How do I control the magnitude at which I shuffle my dataset

I have a dataset X where each data point (each row) is in a particular order.
To totally shuffle the X, I use something like this:
shufX = torch.randperm(len(X))
X=X[shufX]
Say I just want to mildly shuffle (maybe shift positions of a few data points) without totally shuffling. I would like to have a parameter p, such that when p=0, it does not shuffle , and when p=1, it totally shuffles like the code about. This way, I can adjust the amount of shuffling to be mild or more extensive.
I attempted this but realized it could result in duplicate data points, which is not what I want.
p = 0.1
mask = torch.bernoulli(p*torch.ones(len(X))).bool()
shufX = torch.randperm(len(X))
X1=X[shufX]
C = torch.where(mask1, X, X1)
Create a shuffle function which only swaps a limited number of items.
import numpy as np
from random import randrange, seed
def shuffle( arr_in, weight = 1.0 ):
count = len( arr_in )
n = int( count * weight ) # Set the number of iterations
for ix in range( n ):
ix0 = randrange( count )
ix1 = randrange( count )
arr_in[ ix0 ], arr_in[ ix1 ] = arr_in[ ix1 ], arr_in[ ix0 ]
# Swap the items from the two chosen indices
seed ( 1234 )
arr = np.arange(50)
shuffle( arr, 0.25 )
print( arr )
# [ 7 15 42 3 4 44 28 0 8 29 10 11 12 13 14 22 16 17 18 19 20 21
# 1 23 24 25 26 27 49 9 41 31 32 33 34 35 36 5 38 30 40 39 2 43
# 37 45 46 47 48 6]
Even with a weight of 1.0 some of the items ( on average ) won't be moved. You can play with the parameters to the function to get the behaviour you need.

How to write in GrADS-readable binary format using python?

I have been trying to export data stored as numpy array to GrADS-flat binary.
It seems GrADS does not recognize Z dimension given in the .ctl file.
Whatever value I use for 'set z integer', GrADS only shows the first level.
Here is the minimal reproduction of my problem
my python code:
import numpy as np
from array import array
data = np.linspace(1, 60, num=60, endpoint=True)
data = np.reshape(data, [5, 4, 3])
print(data)
with open('temp.dat', 'ab') as wf:
float_array = array('f', data.flatten())
float_array.tofile(wf)
wf.close()
executing this writes numbers [1, 2, 3, ..., 60] as single-precision float to a binary file
my .ctl file:
DSET ^temp.dat
TITLE title
UNDEF -9.99E33
XDEF 3 LINEAR 0.0 1
YDEF 4 LINEAR 0.0 1
ZDEF 5 LEVELS 0 1 2 3 4
TDEF 1 LINEAR 0Z10apr1991 12hr
VARS 1
var 0 99 some var
ENDVARS
This set of .dat and .ctl files shows first 12 numbers as first level of the field as expected.
ga-> open temp.ctl
Scanning description file: temp.ctl
Data file temp.dat is open as file 1
LON set to 0 2
LAT set to 0 3
LEV set to 0 0
Time values set: 1991:4:10:0 1991:4:10:0
E set to 1 1
ga-> set digsize 0.6
digsiz = 0.6
ga-> set lon -1 3
LON set to -1 3
ga-> set lat -1 4
LAT set to -1 4
ga-> set gxout grid
ga-> d var
However, if I try to 'set z 2' it still shows the first level.
Orelse, var(z=3) and var(z=1) being
var(z=1)=
[[ 1. 2. 3.]
[ 4. 5. 6.]
[ 7. 8. 9.]
[10. 11. 12.]]
var(z=3)=
[[25. 26. 27.]
[28. 29. 30.]
[31. 32. 33.]
[34. 35. 36.]]
this should show the constant field of "24" ... but GrADS is showing "0" as if z=3 is the same with z=1.
ga-> c
ga-> d var(z=3) - var(z=1)
What is even more ununderstantable is that if I give var2 in .ctl file,
grads recognize what it should be var(z=2) as var2(z=1)!
I know there are lots of better visualization tools other than GrADS, but I want to use legacy GrADS codes so it is unavoidable.
Did I write the binary in the wrong order? or is the binary file missing header or separator or something?
I am grad if anyone knows why it is happening.
Thanks in advance.
My colleague has just pointed out that the problem is in .ctl file.
I have set the layer number to zero, which GrADS recognize as "surface layer",
which is a single layer variable field and can be overlayed upon any layer above it.
I am able to show any layer of the field with the corrected .ctl file.
my corrected .ctl.
DSET ^temp.dat
TITLE title
UNDEF -9.99E33
XDEF 3 LINEAR 0.0 1
YDEF 4 LINEAR 0.0 1
ZDEF 5 LEVELS 0 1 2 3 4
TDEF 1 LINEAR 0Z10apr1991 12hr
VARS 1
// this should be 5 not zero because there are five layers!
// var 0 99 some var
var 5 99 some var
ENDVARS

Combine two numpy arrays and covert them into a dataframe

I have two Dataframes (X & y) sliced off the main dataframe df as below :
X = df.ix[:,df.columns!='Class']
y = df.ix[:,df.columns=='Class']
from imblearn.over_sampling import SMOTE
sm = SMOTE()
X_resampled , y_resampled = sm.fit_sample(X,y.values.ravel())
The last line returns a numpy 2-d array for X_resampled and y_resampled.
So I would want to know how to convert X_resampled and y_resampled back into a dataframe.
Example Data :
X_resampled :Dimensions(2,30) : 2 rows,30 columns
array([[ 0. , -1.35980713, -0.07278117, 2.53634674, 1.37815522,
-0.33832077, 0.46238778, 0.23959855, 0.0986979 , 0.36378697,
0.09079417, -0.55159953, -0.61780086, -0.99138985, -0.31116935,
1.46817697, -0.47040053, 0.20797124, 0.02579058, 0.40399296,
0.2514121 , -0.01830678, 0.27783758, -0.11047391, 0.06692807,
0.12853936, -0.18911484, 0.13355838, -0.02105305, 0.24496426],
[ 0. , 1.19185711, 0.26615071, 0.16648011, 0.44815408,
0.06001765, -0.08236081, -0.07880298, 0.08510165, -0.25542513,
-0.16697441, 1.61272666, 1.06523531, 0.48909502, -0.1437723 ,
0.63555809, 0.46391704, -0.11480466, -0.18336127, -0.14578304,
-0.06908314, -0.22577525, -0.63867195, 0.10128802, -0.33984648,
0.1671704 , 0.12589453, -0.0089831 , 0.01472417, -0.34247454]])
y_resampled :Dimensions (2,) - Coressponding to the two rows of X_resampled.
array([0, 0], dtype=int64)
I believe you need numpy.hstack:
a = np. array([[ 0. , -1.35980713, -0.07278117, 2.53634674, 1.37815522,
-0.33832077, 0.46238778, 0.23959855, 0.0986979 , 0.36378697,
0.09079417, -0.55159953, -0.61780086, -0.99138985, -0.31116935,
1.46817697, -0.47040053, 0.20797124, 0.02579058, 0.40399296,
0.2514121 , -0.01830678, 0.27783758, -0.11047391, 0.06692807,
0.12853936, -0.18911484, 0.13355838, -0.02105305, 0.24496426],
[ 0. , 1.19185711, 0.26615071, 0.16648011, 0.44815408,
0.06001765, -0.08236081, -0.07880298, 0.08510165, -0.25542513,
-0.16697441, 1.61272666, 1.06523531, 0.48909502, -0.1437723 ,
0.63555809, 0.46391704, -0.11480466, -0.18336127, -0.14578304,
-0.06908314, -0.22577525, -0.63867195, 0.10128802, -0.33984648,
0.1671704 , 0.12589453, -0.0089831 , 0.01472417, -0.34247454]])
b = np.array([0, 100])
c = pd.DataFrame(np.hstack((a,b[:, None])))
print (c)
0 1 2 3 4 5 6 7 \
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803
8 9 ... 21 22 23 24 \
0 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928
1 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846
25 26 27 28 29 30
0 0.128539 -0.189115 0.133558 -0.021053 0.244964 0.0
1 0.167170 0.125895 -0.008983 0.014724 -0.342475 100.0
[2 rows x 31 columns]

How can I generate a rolling metric like this in Pandas

I have a dataframe that initially contains two columns, Home, which is 1 if a game was player at home, else 0, and PTS, which records the number of points a player scored in a given game. I want to end up with a third column, a rolling metric that represents how sensitive a player is to playing at home. I'll calculate this as follows:
Home Sensitivity = (Average PTS Home - Average PTS Away)/Average PTS
I did this successfully in the following code, but it felt cumbersome, as I created many columns I didn't need in the end. How can I solve this problem more directly?
df=pd.DataFrame({'Home':[1,0,1,0,1,0,1,0], 'PTS':[11, 10, 12, 11, 13, 12, 14, 12]})
df.loc[testDF['Home'] == 1, 'Home PTS'] = df['PTS']
df.loc[testDF['Home'] == 0, 'Away PTS'] = df['PTS']
df['Home PTS'] = df['Home PTS'].fillna(0)
df['Away PTS'] = df['Away PTS'].fillna(0)
df['Home Sum'] = df['Home PTS'].expanding(min_periods=1).sum()
df['Away Sum'] = df['Away PTS'].expanding(min_periods=1).sum()
df['Home Count']=df['Home'].expanding().sum()
df['Index']=df.index+1
df['Away Count']=df['Index']-df['Home Count']
df['Home Average']=df['Home Sum']/df['Home Count']
df['Away Average']=df['Away Sum']/df['Away Count']
df['Average']=df['PTS'].expanding().mean()
df['Metric']=(df['Home Average']-df['Away Average'])/df['Average']
Here is a naive way to do it: take increasingly larger slices of the DataFrame in a loop; do the math on each slice and store it in a list; assign the list to a new column of the DataFrame (using your testDF):
df = tesdDF
sens = []
for i in range(len(df)):
d = df[:i]
mean_pts = d.PTS.mean()
home = d[d.Home == 1].PTS.mean()
away = d[d.Home == 0].PTS.mean()
#print(home, away, (home - away) / mean_pts)
sens.append((home - away) / mean_pts)
df['sens'] = sens
>>> df
Home PTS sens
0 1 11 NaN
1 0 10 NaN
2 1 12 0.095238
3 0 11 0.136364
4 1 13 0.090909
5 0 12 0.131579
6 1 14 0.086957
7 0 12 0.126506
Using DataFrame.expanding(): Not quite there yet ...
>>> mean_pts = df.PTS.expanding(1).mean()
>>> away = df[df['Home'] == 0].PTS.expanding(1).mean()
>>> home = df[df['Home'] == 1].PTS.expanding(1).mean()
>>>
>>> home
0 11.0
2 11.5
4 12.0
6 12.5
Name: PTS, dtype: float64
>>> away
1 10.00
3 10.50
5 11.00
7 11.25
Name: PTS, dtype: float64
>>> mean_pts
0 11.000000
1 10.500000
2 11.000000
3 11.000000
4 11.400000
5 11.500000
6 11.857143
7 11.875000
Name: PTS, dtype: float64
>>>
To do the math will require more manipulation.
You cannot get the difference between home and away directly because the indices are different - but you can do ...
>>> home.values - away.values
array([ 1. , 1. , 1. , 1.25])
>>>
Also home and away only have four rows and mean_pts has eight.
I tried .expanding(1).apply() with the following function and didn't get what I expected, expanding doesn't pass both columns to the function, it appears to pass one column then the other; so I punted...
def f(thing):
print(thing, '***')
return thing.mean()
>>> df.expanding(1).apply(f)
[ 1.] ***
[ 1. 0.] ***
[ 1. 0. 1.] ***
[ 1. 0. 1. 0.] ***
[ 1. 0. 1. 0. 1.] ***
[ 1. 0. 1. 0. 1. 0.] ***
[ 1. 0. 1. 0. 1. 0. 1.] ***
[ 1. 0. 1. 0. 1. 0. 1. 0.] ***
[ 11.] ***
[ 11. 10.] ***
[ 11. 10. 12.] ***
[ 11. 10. 12. 11.] ***
[ 11. 10. 12. 11. 13.] ***
[ 11. 10. 12. 11. 13. 12.] ***
[ 11. 10. 12. 11. 13. 12. 14.] ***
[ 11. 10. 12. 11. 13. 12. 14. 12.] ***

How to select a group of minimum values out of a numpy array?

fitVec = np.zeros((100, 2)) #Initializing the fitVec, where first column` will be the indices and second column will contain the values
After Initialization, fitVec gets assigned some values by running a function.
Final fitVec values:
fitVec [[ 2.00000000e+01 2.42733444e+10]
[ 2.10000000e+01 2.53836270e+10]
[ 2.20000000e+01 2.65580909e+10]
[ 2.30000000e+01 2.76674886e+10]
[ 2.40000000e+01 2.88334239e+10]
[ 2.50000000e+01 3.00078878e+10]
[ 2.60000000e+01 3.11823517e+10]
[ 2.70000000e+01 3.22917494e+10]
[ 2.80000000e+01 3.34011471e+10]
[ 2.90000000e+01 3.45756109e+10]
[ 3.00000000e+01 3.57500745e+10]
[ 3.10000000e+01 3.68594722e+10]
[ 3.20000000e+01 3.79688699e+10]
[ 3.30000000e+01 3.90782676e+10]
[ 3.40000000e+01 4.02527315e+10]
[ 3.50000000e+01 4.14271953e+10]
[ 3.60000000e+01 4.25365930e+10]
[ 3.70000000e+01 4.36476395e+10]]
**I haven't shown all of the 100*4 matrix to make it look less messy.
Now I want to select the twenty (20*4) minimum values out of it.
I'm trying
winner = np.argmin(fitVec[100,1])
but it gives me only one minimum value whereas I want 20 min values. How should I go about it?
First off, I'd separate indices and values; no need to store them both as float. After that, numpy.argsort is your friend:
import numpy
idx = numpy.arange(20, 38, dtype=int)
vals = numpy.random.rand(len(idx))
i = numpy.argsort(vals)
sorted_idx = idx[i]
sorted_vals = vals[i]
print(idx)
print(vals)
print
print(sorted_idx)
print(sorted_vals)
Output:
[20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37]
[ 0.00560689 0.73380138 0.53490514 0.1221538 0.45490855 0.39076217
0.39906252 0.59933451 0.7163099 0.393409 0.15854323 0.4631854
0.92469362 0.69999709 0.67664291 0.73184184 0.52893679 0.60365631]
[20 23 30 25 29 26 24 31 36 22 27 37 34 33 28 35 21 32]
[ 0.00560689 0.1221538 0.15854323 0.39076217 0.393409 0.39906252
0.45490855 0.4631854 0.52893679 0.53490514 0.59933451 0.60365631
0.67664291 0.69999709 0.7163099 0.73184184 0.73380138 0.92469362]

Categories