Combine two numpy arrays and covert them into a dataframe

Combine two numpy arrays and covert them into a dataframe - python

I have two Dataframes (X & y) sliced off the main dataframe df as below :
X = df.ix[:,df.columns!='Class']
y = df.ix[:,df.columns=='Class']
from imblearn.over_sampling import SMOTE
sm = SMOTE()
X_resampled , y_resampled = sm.fit_sample(X,y.values.ravel())
The last line returns a numpy 2-d array for X_resampled and y_resampled.
So I would want to know how to convert X_resampled and y_resampled back into a dataframe.
Example Data :
X_resampled :Dimensions(2,30) : 2 rows,30 columns
array([[ 0. , -1.35980713, -0.07278117, 2.53634674, 1.37815522,
-0.33832077, 0.46238778, 0.23959855, 0.0986979 , 0.36378697,
0.09079417, -0.55159953, -0.61780086, -0.99138985, -0.31116935,
1.46817697, -0.47040053, 0.20797124, 0.02579058, 0.40399296,
0.2514121 , -0.01830678, 0.27783758, -0.11047391, 0.06692807,
0.12853936, -0.18911484, 0.13355838, -0.02105305, 0.24496426],
[ 0. , 1.19185711, 0.26615071, 0.16648011, 0.44815408,
0.06001765, -0.08236081, -0.07880298, 0.08510165, -0.25542513,
-0.16697441, 1.61272666, 1.06523531, 0.48909502, -0.1437723 ,
0.63555809, 0.46391704, -0.11480466, -0.18336127, -0.14578304,
-0.06908314, -0.22577525, -0.63867195, 0.10128802, -0.33984648,
0.1671704 , 0.12589453, -0.0089831 , 0.01472417, -0.34247454]])
y_resampled :Dimensions (2,) - Coressponding to the two rows of X_resampled.
array([0, 0], dtype=int64)

I believe you need numpy.hstack:
a = np. array([[ 0. , -1.35980713, -0.07278117, 2.53634674, 1.37815522,
-0.33832077, 0.46238778, 0.23959855, 0.0986979 , 0.36378697,
0.09079417, -0.55159953, -0.61780086, -0.99138985, -0.31116935,
1.46817697, -0.47040053, 0.20797124, 0.02579058, 0.40399296,
0.2514121 , -0.01830678, 0.27783758, -0.11047391, 0.06692807,
0.12853936, -0.18911484, 0.13355838, -0.02105305, 0.24496426],
[ 0. , 1.19185711, 0.26615071, 0.16648011, 0.44815408,
0.06001765, -0.08236081, -0.07880298, 0.08510165, -0.25542513,
-0.16697441, 1.61272666, 1.06523531, 0.48909502, -0.1437723 ,
0.63555809, 0.46391704, -0.11480466, -0.18336127, -0.14578304,
-0.06908314, -0.22577525, -0.63867195, 0.10128802, -0.33984648,
0.1671704 , 0.12589453, -0.0089831 , 0.01472417, -0.34247454]])
b = np.array([0, 100])
c = pd.DataFrame(np.hstack((a,b[:, None])))
print (c)
0 1 2 3 4 5 6 7 \
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803
8 9 ... 21 22 23 24 \
0 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928
1 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846
25 26 27 28 29 30
0 0.128539 -0.189115 0.133558 -0.021053 0.244964 0.0
1 0.167170 0.125895 -0.008983 0.014724 -0.342475 100.0
[2 rows x 31 columns]

Related

How do I mask only the output (labelled data). I don't have any problem in input data

I have so many Nan values in my output data and I padded those values with zeros. Please don't suggest me to delete Nan or impute with any other no. I want model to skip those nan positions.
example:
x = np.arange(0.5, 30)
x.shape = [10, 3]
x = [[ 0.5 1.5 2.5]
[ 3.5 4.5 5.5]
[ 6.5 7.5 8.5]
[ 9.5 10.5 11.5]
[12.5 13.5 14.5]
[15.5 16.5 17.5]
[18.5 19.5 20.5]
[21.5 22.5 23.5]
[24.5 25.5 26.5]
[27.5 28.5 29.5]]
y = np.arange(2, 10, 0.8)
y.shape = [10, 1]
y[4, 0] = 0.0
y[6, 0] = 0.0
y[7, 0] = 0.0
y = [[2. ]
[2.8]
[3.6]
[4.4]
[0. ]
[6. ]
[0. ]
[0. ]
[8.4]
[9.2]]
I expect keras deep learning model to predict zeros for 5th, 7th and 8th row as similar to the padded value in 'y'.

Numpy Finding Matching number with Array

Any help is greatly appreciated!! I have been trying to solve this for the last few days....
I have two arrays:
import pandas as pd
OldDataSet = {
'id': [20,30,40,50,60,70]
,'OdoLength': [26.12,43.12,46.81,56.23,111.07,166.38]}
NewDataSet = {
'id': [3000,4000,5000,6000,7000,8000]
,'OdoLength': [25.03,42.12,45.74,46,110.05,165.41]}
df1= pd.DataFrame(OldDataSet)
df2 = pd.DataFrame(NewDataSet)
OldDataSetArray = df1.as_matrix()
NewDataSetArray = df2.as_matrix()
The result that I am trying to get is:
Array 1 and Array 2 Match by closes difference, based on left over number from Array2
20 26.12 3000 25.03
30 43.12 4000 42.12
40 46.81 6000 46
50 56.23 7000 110.05
60 111.07 8000 165.41
70 166.38 0 0
Starting at Array 1, ID 20, find the nearest which in this case would be the first Number in Array 2 ID 3000 (26.12-25.03). so ID 20, gets matched to 3000.
Where it gets tricky is if one value in Array 2 is not the closest, then it is skipped. for example, ID 40 value 46.81 is compared to 45.74, 46 and the smallest value is .81 from 46 ID 6000. So ID 40--> ID 6000. ID 5000 in array 2 is now skipped for any future comparisons. So now when comparing array 1 ID 50, it is compared to the next available number in array 2, 110.05. array 1 ID 50 is matched to Array 2 ID 7000.
UPDATE
so here's the code that i have tried and it works. Yes, it is not the greatest, so if someone has another suggestion please let me know.
import pandas as pd
import operator
OldDataSet = {
'id': [20,30,40,50,60,70]
,'OdoLength': [26.12,43.12,46.81,56.23,111.07,166.38]}
NewDataSet = {
'id': [3000,4000,5000,6000,7000,8000]
,'OdoLength': [25.03,42.12,45.74,46,110.05,165.41]}
df1= pd.DataFrame(OldDataSet)
df2 = pd.DataFrame(NewDataSet)
OldDataSetArray = df1.as_matrix()
NewDataSetArray = df2.as_matrix()
newPos = 1
CurrentNumber = 0
OldArrayLen = len(OldDataSetArray) -1
NewArrayLen = len(NewDataSetArray) -1
numberResults = []
for oldPos in range(len(OldDataSetArray)):
PreviousNumber = abs(OldDataSetArray[oldPos, 0]- NewDataSetArray[oldPos, 0])
while newPos <= len(NewDataSetArray) - 1:
CurrentNumber = abs(OldDataSetArray[oldPos, 0] - NewDataSetArray[newPos, 0])
#if it is the last row for the inner array, then match the next available
#in Array 1 to that last record
if newPos == NewArrayLen and oldPos < newPos and oldPos +1 <= OldArrayLen:
numberResults.append([OldDataSetArray[oldPos +1, 1],NewDataSetArray[newPos, 1],OldDataSetArray[oldPos +1, 0],NewDataSetArray[newPos, 0]])
if PreviousNumber < CurrentNumber:
numberResults.append([OldDataSetArray[oldPos, 1], NewDataSetArray[newPos - 1, 1], OldDataSetArray[oldPos, 0], NewDataSetArray[newPos - 1, 0]])
newPos +=1
break
elif PreviousNumber > CurrentNumber:
PreviousNumber = CurrentNumber
newPos +=1
#sort by array one values
numberResults = sorted(numberResults, key=operator.itemgetter(0))
numberResultsDf = pd.DataFrame(numberResults)

You can use NumPy broadcasting to build a distance matrix:
a = numpy.array([26.12, 43.12, 46.81, 56.23, 111.07, 166.38,])
b = numpy.array([25.03, 42.12, 45.74, 46, 110.05, 165.41,])
numpy.abs(a[:, None] - b[None, :])
# array([[ 1.09, 16. , 19.62, 19.88, 83.93, 139.29],
# [ 18.09, 1. , 2.62, 2.88, 66.93, 122.29],
# [ 21.78, 4.69, 1.07, 0.81, 63.24, 118.6 ],
# [ 31.2 , 14.11, 10.49, 10.23, 53.82, 109.18],
# [ 86.04, 68.95, 65.33, 65.07, 1.02, 54.34],
# [ 141.35, 124.26, 120.64, 120.38, 56.33, 0.97]])
of that matrix you can then find the closest elements using argmin, either row- or columnwise (depending of if you want to search in a or b).
numpy.argmin(numpy.abs(a[:, None] - b[None, :]), axis=1)
# array([0, 1, 3, 3, 4, 5])

Compute all the differences, and use `np.argmin to lookup the closest.
a,b=np.random.rand(2,10)
all_differences=np.abs(np.subtract.outer(a,b))
ia=all_differences.argmin(axis=1)
for i in range(10):
print(i,a[i],ia[i], b[ia[i]])
0 0.231603891949 8 0.21177584152
1 0.27810475456 7 0.302647382888
2 0.582133214953 2 0.548920922033
3 0.892858042793 1 0.872622982632
4 0.67293347218 6 0.677971552011
5 0.985227546492 1 0.872622982632
6 0.82431697833 5 0.83765895237
7 0.426992114791 4 0.451084369838
8 0.181147161752 8 0.21177584152
9 0.631139744522 3 0.653554586691
EDIT
with dataframes and indexes:
va,vb=np.random.rand(2,10)
na,nb=np.random.randint(0,100,(2,10))
dfa=pd.DataFrame({'id':na,'odo':va})
dfb=pd.DataFrame({'id':nb,'odo':vb})
all_differences=np.abs(np.subtract.outer(dfa.odo,dfb.odo))
ia=all_differences.argmin(axis=1)
dfc=dfa.merge(dfb.loc[ia].reset_index(drop=True),\
left_index=True,right_index=True)
Input :
In [337]: dfa
Out[337]:
id odo
0 72 0.426457
1 12 0.315997
2 96 0.623164
3 9 0.821498
4 72 0.071237
5 5 0.730634
6 45 0.963051
7 14 0.603289
8 5 0.401737
9 63 0.976644
In [338]: dfb
Out[338]:
id odo
0 95 0.333215
1 7 0.023957
2 61 0.021944
3 57 0.660894
4 22 0.666716
5 6 0.234920
6 83 0.642148
7 64 0.509589
8 98 0.660273
9 19 0.658639
Output :
In [339]: dfc
Out[339]:
id_x odo_x id_y odo_y
0 72 0.426457 64 0.509589
1 12 0.315997 95 0.333215
2 96 0.623164 83 0.642148
3 9 0.821498 22 0.666716
4 72 0.071237 7 0.023957
5 5 0.730634 22 0.666716
6 45 0.963051 22 0.666716
7 14 0.603289 83 0.642148
8 5 0.401737 95 0.333215
9 63 0.976644 22 0.666716

How can I generate a rolling metric like this in Pandas

I have a dataframe that initially contains two columns, Home, which is 1 if a game was player at home, else 0, and PTS, which records the number of points a player scored in a given game. I want to end up with a third column, a rolling metric that represents how sensitive a player is to playing at home. I'll calculate this as follows:
Home Sensitivity = (Average PTS Home - Average PTS Away)/Average PTS
I did this successfully in the following code, but it felt cumbersome, as I created many columns I didn't need in the end. How can I solve this problem more directly?
df=pd.DataFrame({'Home':[1,0,1,0,1,0,1,0], 'PTS':[11, 10, 12, 11, 13, 12, 14, 12]})
df.loc[testDF['Home'] == 1, 'Home PTS'] = df['PTS']
df.loc[testDF['Home'] == 0, 'Away PTS'] = df['PTS']
df['Home PTS'] = df['Home PTS'].fillna(0)
df['Away PTS'] = df['Away PTS'].fillna(0)
df['Home Sum'] = df['Home PTS'].expanding(min_periods=1).sum()
df['Away Sum'] = df['Away PTS'].expanding(min_periods=1).sum()
df['Home Count']=df['Home'].expanding().sum()
df['Index']=df.index+1
df['Away Count']=df['Index']-df['Home Count']
df['Home Average']=df['Home Sum']/df['Home Count']
df['Away Average']=df['Away Sum']/df['Away Count']
df['Average']=df['PTS'].expanding().mean()
df['Metric']=(df['Home Average']-df['Away Average'])/df['Average']

Here is a naive way to do it: take increasingly larger slices of the DataFrame in a loop; do the math on each slice and store it in a list; assign the list to a new column of the DataFrame (using your testDF):
df = tesdDF
sens = []
for i in range(len(df)):
d = df[:i]
mean_pts = d.PTS.mean()
home = d[d.Home == 1].PTS.mean()
away = d[d.Home == 0].PTS.mean()
#print(home, away, (home - away) / mean_pts)
sens.append((home - away) / mean_pts)
df['sens'] = sens
>>> df
Home PTS sens
0 1 11 NaN
1 0 10 NaN
2 1 12 0.095238
3 0 11 0.136364
4 1 13 0.090909
5 0 12 0.131579
6 1 14 0.086957
7 0 12 0.126506
Using DataFrame.expanding(): Not quite there yet ...
>>> mean_pts = df.PTS.expanding(1).mean()
>>> away = df[df['Home'] == 0].PTS.expanding(1).mean()
>>> home = df[df['Home'] == 1].PTS.expanding(1).mean()
>>>
>>> home
0 11.0
2 11.5
4 12.0
6 12.5
Name: PTS, dtype: float64
>>> away
1 10.00
3 10.50
5 11.00
7 11.25
Name: PTS, dtype: float64
>>> mean_pts
0 11.000000
1 10.500000
2 11.000000
3 11.000000
4 11.400000
5 11.500000
6 11.857143
7 11.875000
Name: PTS, dtype: float64
>>>
To do the math will require more manipulation.
You cannot get the difference between home and away directly because the indices are different - but you can do ...
>>> home.values - away.values
array([ 1. , 1. , 1. , 1.25])
>>>
Also home and away only have four rows and mean_pts has eight.
I tried .expanding(1).apply() with the following function and didn't get what I expected, expanding doesn't pass both columns to the function, it appears to pass one column then the other; so I punted...
def f(thing):
print(thing, '***')
return thing.mean()
>>> df.expanding(1).apply(f)
[ 1.] ***
[ 1. 0.] ***
[ 1. 0. 1.] ***
[ 1. 0. 1. 0.] ***
[ 1. 0. 1. 0. 1.] ***
[ 1. 0. 1. 0. 1. 0.] ***
[ 1. 0. 1. 0. 1. 0. 1.] ***
[ 1. 0. 1. 0. 1. 0. 1. 0.] ***
[ 11.] ***
[ 11. 10.] ***
[ 11. 10. 12.] ***
[ 11. 10. 12. 11.] ***
[ 11. 10. 12. 11. 13.] ***
[ 11. 10. 12. 11. 13. 12.] ***
[ 11. 10. 12. 11. 13. 12. 14.] ***
[ 11. 10. 12. 11. 13. 12. 14. 12.] ***

numpy array converted to pandas dataframe drops values

I need to calculate statistics for each node of a 2D grid. I figured the easy way to do this was to take the cross join (AKA cartesian product) of two ranges. I implemented this using numpy as this function:
def node_grid(x_range, y_range, x_increment, y_increment):
x_min = float(x_range[0])
x_max = float(x_range[1])
x_num = (x_max - x_min)/x_increment + 1
y_min = float(y_range[0])
y_max = float(y_range[1])
y_num = (y_max - y_min)/y_increment + 1
x = np.linspace(x_min, x_max, x_num)
y = np.linspace(y_min, y_max, y_num)
ng = list(product(x, y))
ng = np.array(ng)
return ng, x, y
However when I convert this to a pandas dataframe it drops values. For example:
In [2]: ng = node_grid(x_range=(-60, 120), y_range=(0, 40), x_increment=0.1, y_increment=0.1)
In [3]: ng[0][(ng[0][:,0] > -31) & (ng[0][:,0] < -30) & (ng[0][:,1]==10)]
Out[3]: array([[-30.9, 10. ],
[-30.8, 10. ],
[-30.7, 10. ],
[-30.6, 10. ],
[-30.5, 10. ],
[-30.4, 10. ],
[-30.3, 10. ],
[-30.2, 10. ],
[-30.1, 10. ]])
In [4]: node_df = pd.DataFrame(ng[0])
node_df.columns = ['xx','depth']
print(node_df[(node_df.depth==10) & node_df.xx.between(-30,-31)])
Out[4]:Empty DataFrame
Columns: [xx, depth]
Index: []
The dataframe isn't empty:
In [5]: print(node_df.head())
Out[5]: xx depth
0 -60.0 0.0
1 -60.0 0.1
2 -60.0 0.2
3 -60.0 0.3
4 -60.0 0.4
values from the numpy array are being dropped when they are being put into the pandas array. Why?

the "between" function demands that the first argument be less than the latter.
In: print(node_df[(node_df.depth==10) & node_df.xx.between(-31,-30)])
xx depth
116390 -31.0 10.0
116791 -30.9 10.0
117192 -30.8 10.0
117593 -30.7 10.0
117994 -30.6 10.0
118395 -30.5 10.0
118796 -30.4 10.0
119197 -30.3 10.0
119598 -30.2 10.0
119999 -30.1 10.0
120400 -30.0 10.0
For clarity the product() function used comes from the itertools package, i.e., from itertools import product

I can't fully reproduce your code.
But I find the problem is that you have to turn the lower and upper boundaries around in the between query. The following works for me:
print(node_df[(node_df.depth==10) & node_df.xx.between(-31,-30)])
when using:
ng = np.array([[-30.9, 10. ],
[-30.8, 10. ],
[-30.7, 10. ],
[-30.6, 10. ],
[-30.5, 10. ],
[-30.4, 10. ],
[-30.3, 10. ],
[-30.2, 10. ],
[-30.1, 10. ]])
node_df = pd.DataFrame(ng)

Python: Sending discontinuous data with mpi4py

I have a C-ordered matrix of dimensions (N,M)
mat = np.random.randn(N, M)
of which I want to send a column through a persistent MPI request to another node. However, using mpi4py,
sreq = MPI.COMM_WORLD.Send_Init((mat[:,idx], MPI.DOUBLE), send_id, tag)
fails on account of the slice being non-contiguous. Can someone suggest a way of going about this? I believe in C that MPI_Type_vector allows for one to specify a stride when creating a type. How can I accomplish this with mpi4py?

create a sendbuffer!
look at this example:
1 #!/usr/bin/python2
2 # -*- coding: utf-8 -*-
3
4 from mpi4py import MPI
5 import numpy as np
6
7 comm = MPI.COMM_WORLD
8 rank = comm.Get_rank()
9
10 matrix = np.empty((5, 10), dtype='f')
11 for y in xrange(len(matrix)):
12 for x in xrange(len(matrix[0])):
13 matrix[y,x] = rank * 10 + x * y
14
15 sendbuf = np.empty(5, dtype='f')
16
17 #column 1
18 sendbuf[:] = matrix[:,1]
19
20 result = comm.gather(sendbuf, root=0)
21
22 if rank == 0:
23 for res in result:
24 print res
this will give you:
$ mpirun -np 4 column.py
[ 0. 1. 2. 3. 4.]
[ 10. 11. 12. 13. 14.]
[ 20. 21. 22. 23. 24.]
[ 30. 31. 32. 33. 34.]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Combine two numpy arrays and covert them into a dataframe - python

Related

How do I mask only the output (labelled data). I don't have any problem in input data

Numpy Finding Matching number with Array

How can I generate a rolling metric like this in Pandas

numpy array converted to pandas dataframe drops values

Python: Sending discontinuous data with mpi4py

Categories

Resources