Numpy groupby for target encoding (aka mean encoding) - python

I am trying to do a target encoding of the categorical columns of an array X of features based on a target 0-1 array y, i.e. substitute each column level in feature x_i with the mean value of the target (i.e. number of 1's) for that level.
The following code is likely to be inefficient, because of the two 2 loops to mimic the group-by. Is there any room for improvement for such implementation (avoiding the slow pandas group-by)? Thank you
import numpy as np
np.random.seed(9)
rows, cols= 100_00,500
x = np.random.choice(['a','b','c','d','e',"f","g"],size=(rows,cols))
y = np.random.choice([0,1], size =(rows,1))
#learn encoding
for colum in range(X.shape[1]):
c = X[:,colum]
if c.dtype.kind=="U":
unique = np.unique(c)
tmap_num={}
for uni in unique:
tmap_num[uni]=y[c==uni].mean()
maps_num[str(colum)] = tmap_num
#apply encoding
X = X.astype('<U32')
for col, tmap in maps.items():
vals = np.full(X.shape[0], np.nan)
for val, mean_target in tmap.items():
vals[X[:,int(col)]==val] = mean_target
X[:,int(col)] = vals

Related

Successive calculation of coefficient of variation

I'm having a long list of values which is triple indexed (i,j,t). For all i in I and j in J I have to extract all t values and calculate the coefficient of variation (cv) successively. The length of the cv list ist len(I)*len(J). Then I plot the cv list and check whether the cv converged to sum number.
Right now I am looping, which is rather inefficient (see example). Is there another possibility which avoids the loops?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
iN = 10
jN = 10
tN = 20
I = range(iN)
J = range(jN)
T = range(tN)
idx = pd.MultiIndex.from_product([I,J,T])
data = np.random.normal(loc=1, size=iN*jN*tN)
df = pd.DataFrame(data, index=idx, columns=['value'])
values_lst = []
cv_lst = []
for i in I:
for j in J:
values_lst.extend(df.loc[(i,j,slice(None)), 'value'])
sd = np.std(values_lst, ddof=1)
mean = np.mean(values_lst)
cv_lst.append(sd/mean)
plt.plot(cv_lst)
plt.show()
Im posting this in the assumtion that you can easily extract your data into a (i,j,t) sized numpy array. In that case you can let numpy do its magic:
cv_list = np.std(data, axis=2, ddof=1)/np.mean(data, axis=2)
axis=2 means you do your mean or std over your axis t. The result is still a 2D array you can reshape or flatten as you like.

Dataframe multiplication with multiple index

Data: Here
Question:
I have several data sheets which I export to Python as dataframes. I want to perform multiplications across these dataframes, which will generate another dataframe that takes the same dimension as the dataframes I use and/or augment the dimension (i.e. the index) based on the combination from the different dataframes used. However, I stumble upon some issues to which I could not find a solution. Below is the code.
Code:
#---------------------------------------------------------------------------------------------------
#Load the pandas library
#---------------------------------------------------------------------------------------------------
import numpy as np
import pandas as pd
#---------------------------------------------------------------------------------------------------
#Load the dataframes
#---------------------------------------------------------------------------------------------------
##Supply at the gridcell level (in Pj per year)
biosup = pd.read_excel('01EconMod_EU1.xlsx', sheet_name = 'biosup', skiprows = 5, index_col = 0, usecols = 'A:K')
##Cost at the gridcell level (in MEUR per Pj)
biocost = pd.read_excel('01EconMod_EU1.xlsx', sheet_name = 'biocost', skiprows = 5, index_col = 0, usecols = 'A:K')
##Demand at the gridcell level (in Pj per year)
biodem = pd.read_excel('01EconMod_EU1.xlsx', sheet_name = 'biodem', skiprows = 5, index_col = [0,1], usecols = 'A:L')
##Inter-gridcell distance matrix (in km)
dist = pd.read_excel('01EconMod_EU1.xlsx', sheet_name = 'distance', skiprows = 5, index_col = 0, usecols = 'A:AE')
#---------------------------------------------------------------------------------------------------
#Definition of model parameter
#---------------------------------------------------------------------------------------------------
##Power parameter for the distance-decay component (gamma)
gamma = pd.DataFrame({'sim1':[1.06],'sim2':[1.59],'sim3':[2.12]})
gamma = gamma.transpose()
gamma.columns = ['val']
##Inter-gridcell distance range for the supply curve determination (dmaxsup in km)
dmaxsup = pd.DataFrame({'dsup1':[390],'dsup2':[770],'dsup3':[1050]})
dmaxsup = dmaxsup.transpose()
dmaxsup.columns = ['dmax']
##Inter-gridcell distance range for the distance-decay (dmaxdem in km)
dmaxdem = pd.DataFrame({'ddem1':[750],'ddem2':[1000]})
dmaxdem = dmaxdem.transpose()
dmaxdem.columns = ['dmax']
#---------------------------------------------------------------------------------------------------
#New parameter calculation
#---------------------------------------------------------------------------------------------------
##The ratio of the inter-gridcell distance and the dmaxdem
dist1 = pd.DataFrame(np.concatenate(dist.values / dmaxdem.values[:, None]), pd.MultiIndex.from_product([dmaxdem.index, dist.index]), dist.columns)
##The decay coefficients
decay = pd.DataFrame(np.concatenate(2 * (1 / (1 + (np.exp(dist1.values)**gamma.values[:, None])))), pd.MultiIndex.from_product([gamma.index, dist1.index]), dist1.columns)
decay1 = pd.DataFrame(np.concatenate(2 * (1 / (1 + (np.exp(dist.values / dmaxdem.values[:, None])**gamma.values[:, None])))), pd.MultiIndex.from_product([dmaxdem.index, gamma.index, dist.index]), dist.columns)
Comments on the code:
1/The parameter "dist1" represents the division of the "dist" dataframe by each of the element of the "dmaxdem" dataframe. Think of the values of the "dmaxdem" dataframe are distance scenarios. In other words, this operation computes the ratio for each of the distance values prodived.
2/ I try to compute a distance decay coefficients, i.e. "decay" dataframe, as defined by the formula inside the brackets. However, I get the following error message
NotImplementedError: isna is not defined for MultiIndex
which I believe has something to do with the multiindex structure of the "dist1" dataframe. I have tried a direct approach by embedding the previous operation, and which will require the use of the 3 different dataframes as illustrated by the code for "decay1". I get the following error
ValueError: operands could not be broadcast together with shapes (2,30,30) (3,1,1)
Any help would be appreciated.
pardon me if I misunderstood you because I am unable to comment before posting answer:
Well, if they are all the same length, and have the same index, you can start off by first concatenation them along the 0 axis. This will create a larger dataframe. Next, you can assert a conditional column or columns that you need:
largerdf = pd.concat([df1, df2, df3 , dfn], axis=0)
largerdf[“calculationcolumn”] = largerdf[“columnvalue1”] *largerdf[“columnvalue2”]
Or change the operand to any you need.

Is there a way to get dot product of two huge matrix using pandas or pyspark

I'm doing collaborative filtering and in predict phase I need to get matrix multiplication of two big matrices(4mln x 7 and 25k x 7) for SVD predictions. Is there an efficient and fast way to so, maybe using pandas or pyspark
Right now I came up with solution to get dot product row by row but that is time consuming:
for i in range(products):
user_ratings = np.dot(X_products[i], X_user)
m = np.min(user_ratings)
items[:,-1] = j
ratings[:,-1] = user_ratings
reorder_cols = np.fliplr(np.argsort(ratings, axis = 1))
rows = np.arange(num_users)[:,np.newaxis]
# reorder
ratings = ratings[rows, reorder_cols]
items = items[rows, reorder_cols]
Any suggestions will be appreciated
I would suggest using pyspark's mllib.linalg.distributed module. suppose your big matrices are M1 & M2 and you have converted them into RDDs.
1. convert them into BlockMatrices.
bm_M1 = IndexedRowMatrix(M1.zipWithIndex().map(lambda x:
(x[1],Vectors.dense(x[0])))).toBlockMatrix(10,10)
bm_M2 = IndexedRowMatrix(M2.ZipWithIndex().map(lambda x:
(x[1],Vectors.dense(x[0])))).toBlockMatrix(10,10)
2. transpose bm_M2 and multiply
bm_M1.multiply(bm_M2.transpose())
An example
import numpy as np
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.linalg.distributed import *
mat = sc.parallelize(np.random.rand(4,4))
bm_M1 = IndexedRowMatrix(mat.zipWithIndex().map(lambda x:
(x[1],Vectors.dense(x[0])))).toBlockMatrix(1,1)

combining/merging multiple 2d arrays into single array by using python

I have four 2 dimensional np arrays. Shape of each array is (203 , 135). Now I want join all these arrays into one single array with respect to latitude and longitude.
I have used code below to read data
import pandas as pd
import numpy as np
import os
import glob
from pyhdf import SD
import datetime
import mpl_toolkits.basemap.pyproj as pyproj
DATA = ({})
files = glob.glob('MOD04*')
files.sort()
for n, f in enumerate(files):
SDS_NAME='Deep_Blue_Aerosol_Optical_Depth_550_Land'
hdf=SD.SD(f)
lat = hdf.select('Latitude')
latitude = lat[:]
min_lat=latitude.min()
max_lat=latitude.max()
lon = hdf.select('Longitude')
longitude = lon[:]
min_lon=longitude.min()
max_lon=longitude.max()
sds=hdf.select(SDS_NAME)
data=sds.get()
p = pyproj.Proj(proj='utm', zone=45, ellps='WGS84')
x,y = p(longitude, latitude)
def set_element(elements, x, y, data):
# Set element with two coordinates.
elements[x + (y * 10)] = data
elements = []
set_element(elements,x,y,data)
But I got error: only integer arrays with one element can be converted to an index
you can find the data: https://drive.google.com/open?id=0B2rkXkOkG7ExMElPRDd5YkNEeDQ
I have created toy datasets for this problem as per requested.
what I want is to get one single array from four (a,b,c,d) arrays. whose dimension should be something like (406, 270)
a = (np.random.rand(27405)).reshape(203,135)
b = (np.random.rand(27405)).reshape(203,135)
c = (np.random.rand(27405)).reshape(203,135)
d = (np.random.rand(27405)).reshape(203,135)
a_x = (np.random.uniform(10,145,27405)).reshape(203,135)
a_y = (np.random.uniform(204,407,27405)).reshape(203,135)
d_x = (np.random.uniform(150,280,27405)).reshape(203,135)
d_y = (np.random.uniform(204,407,27405)).reshape(203,135)
b_x = (np.random.uniform(150,280,27405)).reshape(203,135)
b_y = (np.random.uniform(0,202,27405)).reshape(203,135)
c_x = (np.random.uniform(10,145,27405)).reshape(203,135)
c_y = (np.random.uniform(0,202,27405)).reshape(203,135)
any help?
This should be a comment, yet the comment space is not enough for these questions. Therefore I am posting here:
You say that you have 4 input arrays (a,b,c,d) which are somehow to be intergrated into an output array. As far as is understood, two of these arrays contain positional information (x,y) such as longitude and latitude. The only line in your code, where you combine several input arrays is here:
def set_element(elements, x, y, data):
# Set element with two coordinates.
elements[x + (y * 10)] = data
Here you have four input variables (elements, x, y, data) which I assume to be your input arrays (a,b,c,d). In this operation yet you do not combine them, but you overwrite an element of elements (index: x + 10y) with a new value (data).
Therefore, I do not understand your target output.
When I was asking for toy data, I had something like this in mind:
a = [[1,2]]
b = [[3,4]]
c = [[5,6]]
d = [[7,8]]
This would be such an easy example that you could easily say:
What I want is this:
res = [[[1,2],[3,4]],[[5,6],[7,8]]]
Then we could help you to find an answer.
Please, thus, provide more information about the operation that you want to conduct either mathematically notated ( such as x = a +b*c +d) or with toy data so that we can deduce the function you ask for.

Calculate subplot adjustment

So I have some data calculated that now should be visualised. For each data element, I want to place a separate subplot so that the whole figure is as compact as possible. Here's an example for five elements:
Here's a prototype I came up with for an arbitrary elements count:
import matplotlib.pyplot as plt
import numpy as np
import math
data = ... # some list of pairs of numpy arrays, for x and y axes
size = len(data)
cols = math.floor(math.sqrt(size))
rows = math.ceil(size / cols)
f, diags = plt.subplots(rows, cols)
for (row, col), diag in np.ndenumerate(diags):
dataIdx = row * cols + col
if dataIdx < size:
x = data[dataIdx][0]
y = data[dataIdx][1]
diag.scatter(x, y)
diag.set_title('Regressor {}'.format(dataIdx + 1))
else: # discard empty subplots
f.delaxes(diag)
f.show()
A short explanation: for compactness, I'm trying to adjust the plots in form of a square matrix if possible. If not, I add another row for the remaining diagrams. Then I iterate the diagrams, calculate the according position of data element and plot its values. If no data element is found for the diagram, it means the diagram is a remainder from the last row and can be discarded.
However, this is the code I would probably write in C++ or Java, the question is, what would the the pythonic way?
Also, what would be the best solution for this when iterating over data instead of diagrams? I could of course calculate the diagram's row/column from the element index the same way I did in the initial rows/columns calculation, but maybe there's a better way to do this...
Thanks in advance!
I would likely create the plot like this:
size = len(data)
cols = round(math.sqrt(size))
rows = cols
while rows * cols < size:
rows += 1
f, ax_arr = plt.subplots(rows, cols)
ax_arr = ax_arr.reshape(-1)
for i in range(len(ax_arr)):
if i >= size:
ax_arr[i].axis('off')
x = data[i][0]
y = data[i][1]
ax_arr[i].scatter(x,y)

Categories