Structure arrays for broadcasting numpy python - python

I have a dataframe with is in long-format
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': [1], 'col2': [10]})
ratio = pd.Series([0.1, 0.70, 0.2])
# Expected Output
df_multiplied = pd.DataFrame({'col1': [0.1, 0.7, 0.2], 'col2': [1, 7, 2]})
My attempt was to convert it into numpy arrays and use np.tile
np.tile(df.T.values, len(df_ratio) * np.array(df_ratio).T
Is there any better way to do this?
Thank you!

Repeat the row n times where n is the ratio series' length, then multiple along row axis by the ratio series:
>>> pd.concat([df]*ratio.shape[0], ignore_index=True).mul(ratio, axis='rows')
col1 col2
0 0.1 1.0
1 0.7 7.0
2 0.2 2.0
Or, you can implement similar logic with numpy, repeat the array n times then multiply by ratio values with expanded dimension:
>>> np.repeat([df.values], ratio.shape[0], axis=1)*ratio.values[:,None]
array([[[0.1, 1. ],
[0.7, 7. ],
[0.2, 2. ]]])

Related

Efficient subsetting of pandas dataframe with indexes in 2d numpy array

I have a 2d (in future it may be a 3d) array with indicies. Let's say it looks like that:
[[1,1,1],
[1,2,2],
[2,2,3]]
And I have a pandas dataframe:
index, A, B
1, 0.1, 0.01
2, 0.2, 0.02
3, 0,3, 0.03
I want to get a numpy array (or pandas df) with values from column A, sliced based on numpy array. So the result would be here:
[[0.1,0.1,0.1],
[0.1,0.2,0.2],
[0.2,0.2,0.3]]
I can do it with a loop to get pandas dataframe:
pd.DataFrame(df.A[val].values for val in array)
However I'm looking for more efficient way to do it. Is there better way that allows me to use whole array of indices at once?
You can do:
df.loc[a.ravel(),'A'].values.reshape(a.shape)
Output:
array([[0.1, 0.1, 0.1],
[0.1, 0.2, 0.2],
[0.2, 0.2, 0.3]])

Multiplying by pattern matching

I have a matrix of the following format:
matrix = np.array([1, 2, 3, np.nan],
[1, np.nan, 3, 4],
[np.nan, 2, 3, np.nan])
and coefficients I want to selectively multiply element-wise with my matrix:
coefficients = np.array([0.5, np.nan, 0.2, 0.3],
[0.3, 0.3, 0.2, np.nan],
[np.nan, 0.2, 0.1, np.nan])
In this case, I would want the first row in matrix to be multiplied with the second row in coefficients, while the second row in matrix would be multiplied with the first row in coefficients. In short, I want to select the row in coefficients that matches row in matrix in terms of where np.nan values are located.
The location of np.nan values will be different for each row in coefficients, as they describe the coefficients for different cases of data availability.
Is there a quick way to do this, that doesn't require writing if-statements for all possible cases?
Approach #1
A quick way would be with NumPy broadcasting -
# Mask of NaNs
mask1 = np.isnan(matrix)
mask2 = np.isnan(coefficients)
# Perform comparison between each row of mask1 against every row of mask2
# leading to a 3D array. Look for all-matching ones along the last axis.
# These are the ones that shows the row matches between the two input arrays -
# matrix and coefficients. Then, we use find the corresponding matching
# indices that gives us the pair of matches betweel those two arrays
r,c = np.nonzero((mask1[:,None] == mask2).all(-1))
# Index into arrays with those indices and perform elementwise multiplication
out = matrix[r] * coefficients[c]
Output for given sample data -
In [40]: out
Out[40]:
array([[ 0.3, 0.6, 0.6, nan],
[ 0.5, nan, 0.6, 1.2],
[ nan, 0.4, 0.3, nan]])
Approach #2
For performance, reduce each row of NaNs mask to its decimal equivalent and then create a storing array in which we can store elements off matrix and then multiply into the elements off coefficients indexed by those decimal equivalents -
R = 2**np.arange(matrix.shape[1])
idx1 = mask1.dot(R)
idx2 = mask2.dot(R)
A = np.empty((idx1.max()+1, matrix.shape[1]))
A[idx1] = matrix
A[idx2] *= coefficients
out = A[idx1]

Get pairs of variables from correlation matrix that minimize the sum of correlations

Let's say I get a correlation matrix from a dataframe like here.
Among all pairs of variables, I want to select X variables such that the combination of these X variables is the one for which the total sum of correlation is minimal.
How to do so ?
Here is a not so efficient solution (that gets the 3 out of 4 features, which can be easily extended to 6 out of 10 if you change the n_features from 3 to 6), which works though
import pandas as pd
foo = pd.DataFrame({'vars': ['col_a', 'col_b', 'col_c', 'col_d'],
'col_a': [1, 0.9, 0.04, 0.03],
'col_b': [0.9,1,0.05,0.03],
'col_c': [0.04, 0.05, 1, -0.04],
'col_d': [0.03, 0.03, -0.04,1]})
import numpy as np
import itertools
n_features = 3
test_cols = ['col_a', 'col_b', 'col_c', 'col_d']
sum_l = {}
for l in list(itertools.combinations(test_cols, n_features)):
sum_l2 = 0
for l2 in list(itertools.combinations(l, 2)):
sum_l2 += np.abs(foo.query('vars == #l2[0]')[l2[1]].values[0])
sum_l[l] = sum_l2
print(sum_l)
print(min(sum_l, key=sum_l.get))

Pandas plot with errorbar: style does not apply

I have Pandas (version 0.14.1) DataFrame object like this
import pandas as pd
df = pd.DataFrame(zip([1, 2, 3, 4, 5],
[0.1, 0.3, 0.1, 0.2, 0.4]),
columns=['y', 'dy'])
It returns
y dy
0 1 0.1
1 2 0.3
2 3 0.1
3 4 0.2
4 5 0.4
where the first column is value and the second is error.
First case: I want to make a plot for y-values
df['y'].plot(style="ro-")
Second case: I want to add a vertical errorbars dy for y-values
df['y'].plot(style="ro-", yerr=df['dy'])
So, If I add yerr or xerr parameter to plot method, It ignores style.
Is it Pandas feature or bug?
As TomAugspurger pointed out, it is a known issue. However, it has an easy workaround in most cases: use fmt keyword instead of style keyword to specify shortcut style options.
import pandas as pd
df = pd.DataFrame(zip([1, 2, 3, 4, 5],
[0.1, 0.3, 0.1, 0.2, 0.4]),
columns=['y', 'dy'])
df['y'].plot(fmt='ro-', yerr=df['dy'], grid='on')

Multiplying Rows and Columns of Python Sparse Matrix by elements in an Array

I have a numpy array such as:
array = [0.2, 0.3, 0.4]
(this vector is actually size 300k dense, I'm just illustrating with simple examples)
and a sparse symmetric matrix created using Scipy such as follows:
M = [[0, 1, 2]
[1, 0, 1]
[2, 1, 0]]
(represented as dense just to illustrate; in my real problem it's a (300k x 300k) sparse matrix)
Is it possible to multiply all rows by the elements in array and then make the same operation regarding the columns?
This would result first in :
M = [[0 * 0.2, 1 * 0.2, 2 * 0.2]
[1 * 0.3, 0 * 0.3, 1 * 0.3]
[2 * 0.4, 1 * 0.4, 0 * 0.4]]
(rows are being multiplied by the elements in array)
M = [[0, 0.2, 0.4]
[0.3, 0, 0.3]
[0.8, 0.4, 0]]
And then the columns are multiplied:
M = [[0 * 0.2, 0.2 * 0.3, 0.4 * 0.4]
[0.3 * 0.2, 0 * 0.3, 0.3 * 0.4]
[0.8 * 0.2, 0.4 * 0.3, 0 * 0.4]]
Resulting finally in:
M = [[0, 0.06, 0.16]
[0.06, 0, 0.12]
[0.16, 0.12, 0]]
I've tried applying the solution I found in this thread, but it didn't work; I multiplied the data of the M by the elements in array as it was suggested, then transposed the matrix and applied the same operation but the result wasn't correct, still coudn't understand why!
Just to point this out, the matrix I'll be running this operations are somewhat big, it has 20 million non-zero elements so efficiency is very important!
I appreciate your help!
Edit:
Bitwise solution worked very well. Here it took 1.72 s to compute this operation but that's ok to our work. Tnx!
In general you want to avoid loops and use matrix operations for speed and efficiency. In this case the solution is simple linear algebra, or more specifically matrix multiplication.
To multiply the columns of M by the array A, multiply M*diag(A). To multiply the rows of M by A, multiply diag(A)*M. To do both: diag(A)*M*diag(A), which can be accomplished by:
numpy.dot(numpy.dot(a, m), a)
diag(A) here is a matrix that is all zeros except having A on its diagonal. You can have methods to create this matrix easily (e.g. numpy.diag() and scipy.sparse.diags()).
I expect this to run very fast.
The following should work:
[[x*array[i]*array[j] for j, x in enumerate(row)] for i, row in enumerate(M)]
Example:
>>> array = [0.2, 0.3, 0.4]
>>> M = [[0, 1, 2], [1, 0, 1], [2, 1, 0]]
>>> [[x*array[i]*array[j] for j, x in enumerate(row)] for i, row in enumerate(M)]
[[0.0, 0.059999999999999998, 0.16000000000000003], [0.059999999999999998, 0.0, 0.12], [0.16000000000000003, 0.12, 0.0]]
Values are slightly off due to limitations on floating point arithmetic. Use the decimal module if the rounding error is unacceptable.
I use this combination:
def multiply(matrix, vector, axis):
if axis == 1:
val = np.repeat(array, matrix.getnnz(axis=1))
matrix.data *= val
else:
matrix = matrix.multiply(vector)
return matrix
When the axis is 1 (multiply by rows), I replicate the second approach of this solution,
and when the axis is 0 (multiply by columns) I use multiply
The in-place result (axis=1) is more efficient.

Categories