I have a covariance matrix (as a pandas DataFrame) in python as follows:
a b c
a 1 2 3
b 2 10 4
c 3 4 100
And I want dynamically to select only a subset of the covariance of matrix. For example a subset of A and C would look like
a c
a 1 3
c 3 100
Is there any function that can select this subset?
Thank you!
If your covariance matrix is a numpy array like this:
cov = np.array([[1, 2, 3],
[2, 10, 4],
[3, 4, 100]])
Then you can get the desired submatrix by advanced indexing:
subset = [0, 2] # a, c
cov[np.ix_(subset, subset)]
# array([[ 1, 3],
# [ 3, 100]])
Edit:
If your covariance matrix is pandas DataFrame (e.g. obtained as cov = df.cov() for some dataframe df with columns 'a', 'b', 'c', ...), to get the subset of 'a' and 'c' you can do the following:
cov.loc[['a','c'], ['a','c']]
Related
I want to manipulate categorical data using pandas data frame and then convert them to numpy array for model training.
Say I have the following data frame in pandas.
import pandas as pd
df2 = pd.DataFrame({"c1": ['a','b',None], "c2": ['d','e','f']})
>>> df2
c1 c2
0 a d
1 b e
2 None f
And now I want "compress the categories" horizontally as the following:
compressed_categories
0 c1-a, c2-d <--- this could be a string, ex. "c1-a, c2-d" or array ["c1-a", "c2-d"] or categorical data
1 c1-b, c2-e
2 c1-nan, c2-f
Next I want to generate a dictionary/vocabulary based on the unique occurrences plus "nan" columns in compressed_categories, ex:
volcab = {
"c1-a": 0,
"c1-b": 1,
"c1-c": 2,
"c1-nan": 3,
"c2-d": 4,
"c2-e": 5,
"c2-f": 6,
"c2-nan": 7,
}
So I can further numerically encoding then as follows:
compressed_categories_numeric
0 [0, 4]
1 [1, 5]
2 [3, 6]
So my ultimate goal is to make it easy to convert them to numpy array for each row and thus I can further convert it to tensor.
input_data = np.asarray(df['compressed_categories_numeric'].tolist())
then I can train my model using input_data.
Can anyone please show me an example how to make this series of conversion? Thanks in advance!
To build volcab dictionary and compressed_categories_numeric, you can use:
df3 = df2.fillna(np.nan).astype(str).apply(lambda x: x.name + '-' + x)
volcab = {k: v for v, k in enumerate(np.unique(df3))}
df2['compressed_categories_numeric'] = df3.replace(volcab).agg(list, axis=1)
Output:
>>> volcab
{'c1-a': 0, 'c1-b': 1, 'c1-nan': 2, 'c2-d': 3, 'c2-e': 4, 'c2-f': 5}
>>> df2
c1 c2 compressed_categories_numeric
0 a d [0, 3]
1 b e [1, 4]
2 None f [2, 5]
>>> np.array(df2['compressed_categories_numeric'].tolist())
array([[0, 3],
[1, 4],
[2, 5]])
I am stuck with an issue on a massive pandas table. I would like to get a boolean to check the cross of 2 series.
df = pd.DataFrame({'A': [1, 2, 3, 4],
'B': [10, 1, 2, 8]})
I would like to add one column in my array to get a result like this one
df = pd.DataFrame({'A': [1, 2, 3, 4],
'B': [10, 1, 2, 8],
'C': [0, -1, 0, 1]
})
So basically to get
0 when there is no cross between series B and A
-1 when table B crosses down table A
1 when table B crosses up table A
I need to do vector calculation because my real table is like more than one million rows.
Thank you
You can compute the relative position of the 2 columns with lt, then convert to integer and compute the diff:
m = df['A'].lt(df['B'])
df['C'] = m.astype(int).diff().fillna(0, downcast='infer')
output:
A B C
0 1 10 0
1 2 1 -1
2 3 2 0
3 4 8 1
visual of A/B:
I have the following Pandas dataframe in Python:
import pandas as pd
d = {'col1': [1, 2, 3, 4, 5], 'col2': [6, 7, 8, 9, 10]}
df = pd.DataFrame(data=d)
df.index=['A', 'B', 'C', 'D', 'E']
df
which gives the following output:
col1 col2
A 1 6
B 2 7
C 3 8
D 4 9
E 5 10
I need to write a function (say the name will be getNrRows(fromIndex) ) that will take an index value as input and will return the number of rows between that given index and the last index of the dataframe.
For instance:
nrRows = getNrRows("C")
print(nrRows)
> 2
Because it takes 2 steps (rows) from the index C to the index E.
How can I write such a function in the most elegant way?
The simplest way might be
len(df[row_index:]) - 1
For your information we have built-in function get_indexer_for
len(df)-df.index.get_indexer_for(['C'])-1
Out[179]: array([2], dtype=int64)
The problem is very similar to that in How to evaluate the sum of values within array blocks, where I need to sum up elements in a matrix by blocks. What is different here is that the blocks could have different sizes. For example, given a 4-by-5 matrix
1 1 | 1 1 1
----|------
1 1 | 1 1 1
1 1 | 1 1 1
1 1 | 1 1 1
and block sizes 1 3 along the rows and 2 3 along the columns, the result should be a 2-by-2 matrix:
2 3
6 9
Is there a way of doing this without loops?
Seems like a good fit to use np.add.reduceat to basically sum along rows and then along cols -
def sum_blocks(a, row_sizes, col_sizes):
# Sum rows based on row-sizes
s1 = np.add.reduceat(a,np.r_[0,row_sizes[:-1].cumsum()],axis=0)
# Sum cols from row-summed output based on col-sizes
return np.add.reduceat(s1,np.r_[0,col_sizes[:-1].cumsum()],axis=1)
Sample run -
In [45]: np.random.seed(0)
...: a = np.random.randint(0,9,(4,5))
In [46]: a
Out[46]:
array([[5, 0, 3, 3, 7],
[3, 5, 2, 4, 7],
[6, 8, 8, 1, 6],
[7, 7, 8, 1, 5]])
In [47]: row_sizes = np.array([1,3])
...: col_sizes = np.array([2,3])
In [48]: sum_blocks(a, row_sizes, col_sizes)
Out[48]:
array([[ 5, 13],
[36, 42]])
I read about pandas multi index, but the examples I found did not cover this cases:
I have a set of measurements i in 1..n. Each measurement i consists of attibutes a, b, c, X, Y, Z. While a, b, c have scalar values, X, Y, Z are arrays.
(X, Y, Z have different length for different measurements, but within one measurement i, the arrays X, Y, Z have same length m).
Question: Whats the best way to represent this in a pandas data-frame?
The multi-index examples I saw would index data e.g. first level by i for the measurement and second level by k for index into X,Y,Z. But what about the attributes a, b, c that have just one value for each measurement, but not m? Should the a, b, c values be repeated? Or only the first row of each measurement i contains values for a, b, c, the other 2..m rows contain NaN?
As alluded to in the comment, much of this will depend on the sort of questions you want to be able to answer at the end of the day. One example of a potentially meaningful representation, which takes advantage of the facts that the arrays have the same length within each row, would be to take the product along the arrays, repeating the $i$th row $m_i$ times;
In [12]: from itertools import chain
In [13]: df = pd.DataFrame({'a': [5, 7, 1], 'b': [1, 3, 2], 'c': [10, 1, 1], 'X': [[10, 10], [10, 20, 30], [4, 5]], 'Y': [[20
...: , 30], [1, 4, 5], [0, 1]], 'Z': [[0, 3], [0, 1, 2], [50, 60]]})
In [14]: df
Out[14]:
X Y Z a b c
0 [10, 10] [20, 30] [0, 3] 5 1 10
1 [10, 20, 30] [1, 4, 5] [0, 1, 2] 7 3 1
2 [4, 5] [0, 1] [50, 60] 1 2 1
In [15]: pd.DataFrame({
...: 'a': np.repeat(df.a.values, df.X.apply(len)),
...: 'b': np.repeat(df.b.values, df.X.apply(len)),
...: 'c': np.repeat(df.c.values, df.X.apply(len)),
...: 'X': list(chain.from_iterable(df.X)),
...: 'Y': list(chain.from_iterable(df.Y)),
...: 'Z': list(chain.from_iterable(df.Z))})
...:
Out[15]:
X Y Z a b c
0 10 20 0 5 1 10
1 10 30 3 5 1 10
2 10 1 0 7 3 1
3 20 4 1 7 3 1
4 30 5 2 7 3 1
5 4 0 50 1 2 1
6 5 1 60 1 2 1
This assumes that the fact that the array lengths match across columns means that the arrays themselves are comparable element-wise.