Efficient subsetting of pandas dataframe with indexes in 2d numpy array - python

I have a 2d (in future it may be a 3d) array with indicies. Let's say it looks like that:
[[1,1,1],
[1,2,2],
[2,2,3]]
And I have a pandas dataframe:
index, A, B
1, 0.1, 0.01
2, 0.2, 0.02
3, 0,3, 0.03
I want to get a numpy array (or pandas df) with values from column A, sliced based on numpy array. So the result would be here:
[[0.1,0.1,0.1],
[0.1,0.2,0.2],
[0.2,0.2,0.3]]
I can do it with a loop to get pandas dataframe:
pd.DataFrame(df.A[val].values for val in array)
However I'm looking for more efficient way to do it. Is there better way that allows me to use whole array of indices at once?

You can do:
df.loc[a.ravel(),'A'].values.reshape(a.shape)
Output:
array([[0.1, 0.1, 0.1],
[0.1, 0.2, 0.2],
[0.2, 0.2, 0.3]])

Related

Structure arrays for broadcasting numpy python

I have a dataframe with is in long-format
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': [1], 'col2': [10]})
ratio = pd.Series([0.1, 0.70, 0.2])
# Expected Output
df_multiplied = pd.DataFrame({'col1': [0.1, 0.7, 0.2], 'col2': [1, 7, 2]})
My attempt was to convert it into numpy arrays and use np.tile
np.tile(df.T.values, len(df_ratio) * np.array(df_ratio).T
Is there any better way to do this?
Thank you!
Repeat the row n times where n is the ratio series' length, then multiple along row axis by the ratio series:
>>> pd.concat([df]*ratio.shape[0], ignore_index=True).mul(ratio, axis='rows')
col1 col2
0 0.1 1.0
1 0.7 7.0
2 0.2 2.0
Or, you can implement similar logic with numpy, repeat the array n times then multiply by ratio values with expanded dimension:
>>> np.repeat([df.values], ratio.shape[0], axis=1)*ratio.values[:,None]
array([[[0.1, 1. ],
[0.7, 7. ],
[0.2, 2. ]]])

Best way converting data in PANDAS DataFrame to matrix in Python

I found one thread of converting a matrix to das pandas DataFrame. However, I would like to do the opposite - I have a pandas DataFrame with time series data of this structure:
row time stamp, batch, value
1, 1, 0.1
2, 1, 0.2
3, 1, 0.3
4, 1, 0.3
5, 2, 0.25
6, 2, 0.32
7, 2, 0.2
8, 2, 0.1
...
What I would like to have is a matrix of values with one row belonging to one batch:
[[0.1, 0.2, 0.3, 0.3],
[0.25, 0.32, 0.2, 0.1],
...]
which I want to plot as heatmap using matplotlib or alike.
Any suggestion?
What you can try is to first group by the desired index:
g = df.groupby("batch")
And then convert this group to an array by aggregating using the list constructor.
The result can then be converted to an array using the .values property (or .as_matrix() function, but this is getting deprecated soon.)
mtr = g.aggregate(list).values
One downside of this method is that it will create arrays of lists instead of a nice array, even if the result would lead to a non-jagged array.
Alternatively, if you know that you get exactly 4 values for every unique value of batch you can just use the matrix directly.
df = df.sort_values("batch")
my_indices = [1, 2] # Or whatever indices you desire.
mtr = df.values[:, my_indices] # or df.as_matrix()
mtr = mtr.reshape(-1, 4) # Only works if you have exactly 4 values for each batch
Try and use crosstab from pandas, pd.crosstab(). You will have to confirm the aggfunction.
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.crosstab.html
and then .as_matrix()

Multiplying by pattern matching

I have a matrix of the following format:
matrix = np.array([1, 2, 3, np.nan],
[1, np.nan, 3, 4],
[np.nan, 2, 3, np.nan])
and coefficients I want to selectively multiply element-wise with my matrix:
coefficients = np.array([0.5, np.nan, 0.2, 0.3],
[0.3, 0.3, 0.2, np.nan],
[np.nan, 0.2, 0.1, np.nan])
In this case, I would want the first row in matrix to be multiplied with the second row in coefficients, while the second row in matrix would be multiplied with the first row in coefficients. In short, I want to select the row in coefficients that matches row in matrix in terms of where np.nan values are located.
The location of np.nan values will be different for each row in coefficients, as they describe the coefficients for different cases of data availability.
Is there a quick way to do this, that doesn't require writing if-statements for all possible cases?
Approach #1
A quick way would be with NumPy broadcasting -
# Mask of NaNs
mask1 = np.isnan(matrix)
mask2 = np.isnan(coefficients)
# Perform comparison between each row of mask1 against every row of mask2
# leading to a 3D array. Look for all-matching ones along the last axis.
# These are the ones that shows the row matches between the two input arrays -
# matrix and coefficients. Then, we use find the corresponding matching
# indices that gives us the pair of matches betweel those two arrays
r,c = np.nonzero((mask1[:,None] == mask2).all(-1))
# Index into arrays with those indices and perform elementwise multiplication
out = matrix[r] * coefficients[c]
Output for given sample data -
In [40]: out
Out[40]:
array([[ 0.3, 0.6, 0.6, nan],
[ 0.5, nan, 0.6, 1.2],
[ nan, 0.4, 0.3, nan]])
Approach #2
For performance, reduce each row of NaNs mask to its decimal equivalent and then create a storing array in which we can store elements off matrix and then multiply into the elements off coefficients indexed by those decimal equivalents -
R = 2**np.arange(matrix.shape[1])
idx1 = mask1.dot(R)
idx2 = mask2.dot(R)
A = np.empty((idx1.max()+1, matrix.shape[1]))
A[idx1] = matrix
A[idx2] *= coefficients
out = A[idx1]

Get the index of median value in array containing Nans

How can I get the index of the median value for an array which contains NaNs?
For example, I have the array of values [Nan, 2, 5, NaN, 4, NaN, 3, 1] with correspondent array of errors on those values [np.nan, 0.1, 0.2, np.nan, 0.1, np.nan, 0.4, 0.3]. Then the median is 3, while the error is 0.4.
Is there a simple way to do this?
EDIT: I edited the error array to imply a more realistic situation. And Yes, I am using numpy.
It's not really clear how you intend to meaningfully extract the error from the median, but if you do happen to have an array such that the median is one of its entries, and the corresponding error array is defined at the corresponding index, and there aren't other entries with the same value as the median, and probably several other disclaimers, then you can do the following:
a = np.array([np.nan,2,5,np.nan, 4,np.nan,3,1])
aerr = np.array([np.nan, 0.1, 0.2, np.nan, 0.1, np.nan, 0.4, 0.3])
# median, ignoring NaNs
amedian = np.median(a[np.isfinite(a)])
# find the index of the closest value to the median in a
idx = np.nanargmin(np.abs(a-amedian))
# this is the corresponding "error"
aerr[idx]
EDIT: as #DSM points out, if you have NumPy 1.9 or above, you can simplify the calculation of amedian as amedian = np.nanmedian(a).
numpy has everything you need:
values = np.array([np.nan, 2, 5, np.nan, 4, np.nan, 3, 1])
errors = np.array([np.nan, 0.1, 0.2, np.nan, 0.1, np.nan, 0.4, 0.3])
# filter
filtered = values[~np.isnan(values)]
# find median
median = np.median(filtered)
# find indexes
indexes = np.where(values == median)[0]
# find errors
errors[indexes] # array([ 0.4])
let say you have your list named as "a", then you can use this codeto find a masked array without "Nan" and then do median with a np.ma.median():
a=[Nan, 2, 5, NaN, 4, NaN, 3, 1]
am = numpy.ma.masked_array(a, [numpy.isnan(x) for x in a])
numpy.ma.median(am)
you can do the same for errors as well.

Multiplying Rows and Columns of Python Sparse Matrix by elements in an Array

I have a numpy array such as:
array = [0.2, 0.3, 0.4]
(this vector is actually size 300k dense, I'm just illustrating with simple examples)
and a sparse symmetric matrix created using Scipy such as follows:
M = [[0, 1, 2]
[1, 0, 1]
[2, 1, 0]]
(represented as dense just to illustrate; in my real problem it's a (300k x 300k) sparse matrix)
Is it possible to multiply all rows by the elements in array and then make the same operation regarding the columns?
This would result first in :
M = [[0 * 0.2, 1 * 0.2, 2 * 0.2]
[1 * 0.3, 0 * 0.3, 1 * 0.3]
[2 * 0.4, 1 * 0.4, 0 * 0.4]]
(rows are being multiplied by the elements in array)
M = [[0, 0.2, 0.4]
[0.3, 0, 0.3]
[0.8, 0.4, 0]]
And then the columns are multiplied:
M = [[0 * 0.2, 0.2 * 0.3, 0.4 * 0.4]
[0.3 * 0.2, 0 * 0.3, 0.3 * 0.4]
[0.8 * 0.2, 0.4 * 0.3, 0 * 0.4]]
Resulting finally in:
M = [[0, 0.06, 0.16]
[0.06, 0, 0.12]
[0.16, 0.12, 0]]
I've tried applying the solution I found in this thread, but it didn't work; I multiplied the data of the M by the elements in array as it was suggested, then transposed the matrix and applied the same operation but the result wasn't correct, still coudn't understand why!
Just to point this out, the matrix I'll be running this operations are somewhat big, it has 20 million non-zero elements so efficiency is very important!
I appreciate your help!
Edit:
Bitwise solution worked very well. Here it took 1.72 s to compute this operation but that's ok to our work. Tnx!
In general you want to avoid loops and use matrix operations for speed and efficiency. In this case the solution is simple linear algebra, or more specifically matrix multiplication.
To multiply the columns of M by the array A, multiply M*diag(A). To multiply the rows of M by A, multiply diag(A)*M. To do both: diag(A)*M*diag(A), which can be accomplished by:
numpy.dot(numpy.dot(a, m), a)
diag(A) here is a matrix that is all zeros except having A on its diagonal. You can have methods to create this matrix easily (e.g. numpy.diag() and scipy.sparse.diags()).
I expect this to run very fast.
The following should work:
[[x*array[i]*array[j] for j, x in enumerate(row)] for i, row in enumerate(M)]
Example:
>>> array = [0.2, 0.3, 0.4]
>>> M = [[0, 1, 2], [1, 0, 1], [2, 1, 0]]
>>> [[x*array[i]*array[j] for j, x in enumerate(row)] for i, row in enumerate(M)]
[[0.0, 0.059999999999999998, 0.16000000000000003], [0.059999999999999998, 0.0, 0.12], [0.16000000000000003, 0.12, 0.0]]
Values are slightly off due to limitations on floating point arithmetic. Use the decimal module if the rounding error is unacceptable.
I use this combination:
def multiply(matrix, vector, axis):
if axis == 1:
val = np.repeat(array, matrix.getnnz(axis=1))
matrix.data *= val
else:
matrix = matrix.multiply(vector)
return matrix
When the axis is 1 (multiply by rows), I replicate the second approach of this solution,
and when the axis is 0 (multiply by columns) I use multiply
The in-place result (axis=1) is more efficient.

Categories