Colletting element from different DataFrames into arrays - python

I like using nested data structures and now I'm trying to understand how to use Pandas
Here is a toy model:
a=pd.DataFrame({'x':[1,2],'y':[10,20]})
b=pd.DataFrame({'x':[3,4],'y':[30,40]})
c=[a,b]
now I would like to get:
sol=np.array([[[1],[3]],[[2],[4]]])
I have an idea to get both sol[0] and sol[1] as:
s0=np.array([item[['x']].ix[0] for item in c])
s1=np.array([item[['x']].ix[1] for item in c])
but to get sol I would run over the index and I don't think it is really pythonic...

It looks like you want just the x columns from a and b. You can concatenate two Series (or DataFrames) into a new DataFrame using pd.concat:
In [132]: pd.concat([a['x'], b['x']], axis=1)
Out[132]:
x x
0 1 3
1 2 4
[2 rows x 2 columns]
Now, if you want a numpy array, use the values attribute:
In [133]: pd.concat([a['x'], b['x']], axis=1).values
Out[133]:
array([[1, 3],
[2, 4]], dtype=int64)
And if you want a numpy array with the same shape as sol, then use the reshape method:
In [134]: pd.concat([a['x'], b['x']], axis=1).values.reshape(2,2,1)
Out[134]:
array([[[1],
[3]],
[[2],
[4]]], dtype=int64)
In [136]: np.allclose(pd.concat([a['x'], b['x']], axis=1).values.reshape(2,2,1), sol)
Out[136]: True

Related

Get values of pandas series from a array of index locations

I have a 2-d array of an index of a pandas series. Would like to create a 2-d array of the values from the pandas series that correspond to the index.
For example:
import pandas as pd
import numpy as np
A = pd.Series(data=[1,2,3,4,5])
idx = np.array([[0,2,3],[2,3,1]])
Would like to return:
B = np.array([[1,3,4],[3,4,2]])
I know I could do this as a loop:
B = np.zeros((2,3))
for i in [0,1]:
B[i,:] = test[idx[i]]
However, in practice need to do this repeatedly so would like to broadcast the index locations directly. Pandas is not necessary, happy to do it all in numpy if easier.
Something like this might work:
A[idx.flatten()].values.reshape(idx.shape)
A[idx] gives a Cannot index with multidimensional key error.
In [190]: A = pd.Series(data=[1,2,3,4,5])
...: idx = np.array([[0,2,3],[2,3,1]])
But the 1d array derived from the Series, can be indexed this way:
In [191]: A.values
Out[191]: array([1, 2, 3, 4, 5])
In [192]: A.values[idx]
Out[192]:
array([[1, 3, 4],
[3, 4, 2]])
numpy has no problems returning an array with a dimension that matches idx.
Indexing the Series like this returns a Series - which by definition is 1d:
In [194]: A[idx.ravel()]
Out[194]:
0 1
2 3
3 4
2 3
3 4
1 2
dtype: int64

How can I get a subcovariance from a covariance matrix in python

I have a covariance matrix (as a pandas DataFrame) in python as follows:
a b c
a 1 2 3
b 2 10 4
c 3 4 100
And I want dynamically to select only a subset of the covariance of matrix. For example a subset of A and C would look like
a c
a 1 3
c 3 100
Is there any function that can select this subset?
Thank you!
If your covariance matrix is a numpy array like this:
cov = np.array([[1, 2, 3],
[2, 10, 4],
[3, 4, 100]])
Then you can get the desired submatrix by advanced indexing:
subset = [0, 2] # a, c
cov[np.ix_(subset, subset)]
# array([[ 1, 3],
# [ 3, 100]])
Edit:
If your covariance matrix is pandas DataFrame (e.g. obtained as cov = df.cov() for some dataframe df with columns 'a', 'b', 'c', ...), to get the subset of 'a' and 'c' you can do the following:
cov.loc[['a','c'], ['a','c']]

numpy get row index where elements in certain columns are zero

I want to find indexes of row based on criteria over certain columns
So, something like:
import numpy as np
x = np.random.rand(4, 5)
x[2, 2] = 0
x[2, 3] = 0
x[3, 1] = 0
x[1, 3] = 0
Now, I want to get the index of the rows where either of columns 3 or 4 are zeros. How can one do that with numpy? Do I need to make multiple calls to nonzero for each column and combine these indices using a set or something like that?
Using np.where first array within the tuple is row index
np.where(x[:,[3,4]]==0)
Out[79]: (array([1, 2], dtype=int64), array([0, 0], dtype=int64))

Map index of numpy matrix

How should I map indices of a numpy matrix?
For example:
mx = np.matrix([[5,6,2],[3,3,7],[0,1,6]]
The row/column indices are 0, 1, 2.
So:
>>> mx[0,0]
5
Let s say I need to map these indices, converting 0, 1, 2 into, e.g. 10, 'A', 'B' in the way that:
mx[10,10] #returns 5
mx[10,'A'] #returns 6 and so on..
I can just set a dict and use it to access the elements, but I would like to know if it is possible to do something like what I just described.
I would suggest using pandas dataframe with the index and columns using the new mapping for row and col indexing respectively for ease in indexing. It allows us to select a single element or an entire row or column with the familiar colon operator.
Consider a generic (non-square 4x3 shaped matrix) -
mx = np.matrix([[5,6,2],[3,3,7],[0,1,6],[4,5,2]])
Consider the mappings for rows and columns -
row_idx = [10, 'A', 'B','C']
col_idx = [10, 'A', 'B']
Let's take a look on the workflow with the given sample -
# Get data into dataframe with given mappings
In [57]: import pandas as pd
In [58]: df = pd.DataFrame(mx,index=row_idx, columns=col_idx)
# Here's how dataframe data looks like
In [60]: df
Out[60]:
10 A B
10 5 6 2
A 3 3 7
B 0 1 6
C 4 5 2
# Get one scalar element
In [61]: df.loc['C',10]
Out[61]: 4
# Get one entire col
In [63]: df.loc[:,10].values
Out[63]: array([5, 3, 0, 4])
# Get one entire row
In [65]: df.loc['A'].values
Out[65]: array([3, 3, 7])
And best of all we are not making any extra copies as the dataframe and its slices are still indexing into the original matrix/array memory space -
In [98]: np.shares_memory(mx,df.loc[:,10].values)
Out[98]: True
Try this:
import numpy as np
A = np.array(((1,2),(3,4),(50,100)))
dt = np.dtype([('ID', np.int32), ('Ring', np.int32)])
B = np.array(list(map(tuple, A)), dtype=dt)
print(B['ID'])
You can use the __getitem__ and __setitem__ special methods and create a new class as shown.
Store the index map as a dictionary in an instance variable self.index_map.
import numpy as np
class Matrix(np.matrix):
def __init__(self, lis):
self.matrix = np.matrix(lis)
self.index_map = {}
def setIndexMap(self, index_map):
self.index_map = index_map
def getIndex(self, key):
if type(key) is slice:
return key
elif key not in self.index_map.keys():
return key
else:
return self.index_map[key]
def __getitem__(self, idx):
return self.matrix[self.getIndex(idx[0]), self.getIndex(idx[1])]
def __setitem__(self, idx, value):
self.matrix[self.getIndex(idx[0]), self.getIndex(idx[1])] = value
Usage:
Creating a matrix.
>>> mx = Matrix([[5,6,2],[3,3,7],[0,1,6]])
>>> mx
Matrix([[5, 6, 2],
[3, 3, 7],
[0, 1, 6]])
Defining the Index Map.
>>> mx.setIndexMap({10:0, 'A':1, 'B':2})
Different ways to index the matrix.
>>> mx[0,0]
5
>>> mx[10,10]
5
>>> mx[10,'A']
6
It also handles slicing as shown.
>>> mx[1:3, 1:3]
matrix([[3, 7],
[1, 6]])

Converting dataframe columns to arrays

I'm trying to convert my dataframe columns to arrays. For example, I have a dataframe that looks like this:
Total Price Carrier
2 3 C
1 5 D
I'd like to convert the columns to arrays like this: [[2, 1], [3,5], ['C','D]] I do not want the column names.
I've tried doing this:
df["all"] = 1
df.groupby("all")[["Total","Price", "Carrier"]].apply(list)
However, I get something like this ["Total", "Price", "Carrier"] and is an object and not an array. How can I convert all columns to arrays?
Use df.values instead of apply:
>>> df.values.T.tolist()
[[2, 1], [3, 5], ['C', 'D']]

Categories