I have a dataframe of N columns. Each element in the dataframe is in the range 0, N-1.
For example, my dataframce can be something like (N=3):
A B C
0 0 2 0
1 1 0 1
2 2 2 0
3 2 0 0
4 0 0 0
I want to create a co-occurrence matrix (please correct me if there is a different standard name for that) of size N x N which each element ij contains the number of times that element i and j assume the same value.
A B C
A x 2 3
B 2 x 2
C 3 2 x
Where, for example, matrix[0, 1] means that A and B assume the same value 2 times.
I don't care about the value on the diagonal.
What is the smartest way to do that?
DataFrame.corr
We can define a custom callable function for calculating the correlation between the columns of the dataframe, this callable takes two 1D numpy arrays as its input arguments and return's the count of the number of times the elements in these two arrays equal to each other
df.corr(method=lambda x, y: (x==y).sum())
A B C
A 1.0 2.0 3.0
B 2.0 1.0 2.0
C 3.0 2.0 1.0
Let's try broadcasting across the transposition and summing axis 2:
import pandas as pd
df = pd.DataFrame({
'A': {0: 0, 1: 1, 2: 2, 3: 2, 4: 0},
'B': {0: 2, 1: 0, 2: 2, 3: 0, 4: 0},
'C': {0: 0, 1: 1, 2: 0, 3: 0, 4: 0}
})
vals = df.T.values
e = (vals[:, None] == vals).sum(axis=2)
new_df = pd.DataFrame(e, columns=df.columns, index=df.columns)
print(new_df)
e:
[[5 2 3]
[2 5 2]
[3 2 5]]
Turn back into a dataframe:
new_df = pd.DataFrame(e, columns=df.columns, index=df.columns)
new_df:
A B C
A 5 2 3
B 2 5 2
C 3 2 5
I don't know about the smartest way but I think this works:
import numpy as np
m = np.array([[0, 2, 0], [1, 0, 1], [2, 2, 0], [2, 0, 0], [0, 0, 0]])
n = 3
ans = np.zeros((n, n))
for i in range(n):
for j in range(i+1, n):
ans[i, j] = len(m) - np.count_nonzero(m[:, i] - m[:, j])
print(ans + ans.T)
Related
If I have the following matrix, which the input format is a list of lists:
B
T
E
0
1
0
0
1
1
0
2
1
1
2
0
How can I construct the following python matrix:
0 1
D = [[{1}, {1,2}], 0
[{2}, {}]] 1
Where the elements of D, merge the pairs (B,E) with it's respective T.
Example: (0,1) in the above matrix, have T = 1 and T = 2, so in D matrix it should be a set {1,2}. Since there is no (1,1) pair, it should be a empty set {}.
How could a do that in a "pythonic" way?
You can use collections.defaultdict:
from collections import defaultdict
m = [[0, 1, 0], [0, 1, 1], [0, 2, 1], [1, 2, 0]]
d = defaultdict(dict)
for b, t, e in m:
d[b][e] = [t] if e not in d[b] else [*d[b][e], t]
l = {i for b in d.values() for i in b}
result = [[set(k.get(j, [])) for j in l] for k in d.values()]
print(result)
Output:
[[{1}, {1, 2}],
[{2}, set()]]
It's hard to guess what your input or output format is. If you are using pandas, you can do:
>>> data
[[0, 1, 0],
[0, 1, 1],
[0, 2, 1],
[1, 2, 0]]
>>> df = pd.DataFrame(data, columns=['B', 'T', 'E'])
>>> df
B T E
0 0 1 0
1 0 1 1
2 0 2 1
3 1 2 0
>>> df.groupby(['B', 'E']).agg(set).unstack('E', fill_value=set())
T
E 0 1
B
0 {1} {1, 2}
1 {2} {}
# OR,
>>> df.groupby(['B', 'E']).agg(set).unstack('E', fill_value=set()).to_numpy()
array([[{1}, {1, 2}],
[{2}, set()]], dtype=object)
This is the dictionary I have:
docs = {'computer': {'1': 1, '3': 5, '8': 2},
'politics': {'0': 2, '1': 2, '3': 1}}
I want to create a 9 * 2 tensor like this:
[
[0, 1, 0, 5, 0, 0, 0, 0, 2],
[2, 2, 0, 1, 0, 0, 0, 0, 0, 0]
]
Here, because the max item is 8 so we have 9 rows. But, the number of rows and columns can increase based on the dictionary.
I have tried to implement this using for-loop though as the dictionary is big it's not efficient at all and also it implemented using the list I need that to be a tensor.
maxr = 0
for i, val in docs.items():
for j in val.keys():
if int(j) > int(maxr):
maxr = int(j)
final_lst = []
for val in docs.values():
lst = [0] * (maxr+1)
for j, val2 in sorted(val.items()):
lst[int(j)] = val2
final_lst.append(lst)
print(final_lst)
If you are ok with using pandas and numpy, here's how you can do it.
import pandas as pd
import numpy as np
# Creates a dataframe with keys as index and values as cell values.
df = pd.DataFrame(docs)
# Create a new set of index from min and max of the dictionary keys.
new_index = np.arange( int(df.index.min()),
int(df.index.max())).astype(str)
# Add the new index to the existing index and fill the nan values with 0, take a transpose of dataframe.
new_df = df.reindex(new_index).fillna(0).T.astype(int)
# 0 1 2 3 4 5 6 7
#computer 0 1 0 5 0 0 0 0
#politics 2 2 0 1 0 0 0 0
If you just want the array, you can call array = new_df.values.
#[[0 1 0 5 0 0 0 0]
# [2 2 0 1 0 0 0 0]]
If you want tensor, then you can use tf.convert_to_tensor(new_df.values)
You are given two integer numbers n and r, such that 1 <= r < n,
a two-dimensional array W of size n x n.
Each element of this array is either 0 or 1.
Your goal is to compute density map D for array W, using radius of r.
The output density map is also two-dimensional array,
where each value represent number of 1's in matrix W within the specified radius.
Given the following input array W of size 5 and radius 1 (n = 5, r = 1)
1 0 0 0 1
1 1 1 0 0
1 0 0 0 0
0 0 0 1 1
0 1 0 0 0
Output (using Python):
3 4 2 2 1
4 5 2 2 1
3 4 3 3 2
2 2 2 2 2
1 1 2 2 2
Logic: Input first row, first column value is 1. r value is 1. So we should check 1 right element, 1 left element, 1 top element, top left, top right, bottom , bottom left and bottom right and sum all elements.
Should not use any 3rd party library.
I did it using for loop and inner for loop and check for each element. Any better work around ?
Optimization: For each 1 in W, update count for locations, in whose neighborhood it belongs
Although for W of size nxn, the following algorithm would still take O(n^2) steps, however if W is sparse i.e. number of 1s (say k) << nxn then instead of rxrxnxn steps for approach stated in question, following would take nxn + rxrxk steps, which is much lower if k << nxn
Given r assigned and W stored as
[[1, 0, 0, 0, 1],
[1, 1, 1, 0, 0],
[1, 0, 0, 0, 0],
[0, 0, 0, 1, 1],
[0, 1, 0, 0, 0]]
then following
output = [[ 0 for i in range(5) ] for j in range(5) ]
for i in range(len(W)):
for j in range(len(W[0])):
if W[i][j] == 1:
for off_i in range(-r,r+1):
for off_j in range(-r,r+1):
if (0 <= i+off_i < len(W)) and (0 <= j+off_j < len(W[0])):
output[i+off_i][j+off_j] += 1
stores required values in output
for r = 1, output is as required
[[3, 4, 2, 2, 1],
[4, 5, 2, 2, 1],
[3, 4, 3, 3, 2],
[2, 2, 2, 2, 2],
[1, 1, 2, 2, 2]]
I have matrix similar to this:
1 0 0
1 0 0
0 2 0
0 2 0
0 0 3
0 0 3
(Non-zero numbers denote parts that I'm interested in. Actual number inside matrix could be random.)
And I need to produce vector like this:
[ 1 1 2 2 3 3 ].T
I can do this with loop:
result = np.zeros([rows])
for y in range(rows):
x = y // (rows // cols) # pick index of corresponded column
result[y] = mat[y][x]
But I can't figure out how to do this in vector form.
This might be what you want.
import numpy as np
m = np.array([
[1, 0, 0],
[1, 0, 0],
[0, 2, 0],
[0, 2, 0],
[0, 0, 3],
[0, 0, 3]
])
rows, cols = m.shape
# axis1 indices
y = np.arange(rows)
# axis2 indices
x = y // (rows // cols)
result = m[y,x]
print(result)
Result:
[1 1 2 2 3 3]
I have the following pandas series (represented as a list):
[7,2,0,3,4,2,5,0,3,4]
I would like to define a new series that returns distance to the last zero. It means that I would like to have the following output:
[1,2,0,1,2,3,4,0,1,2]
How to do it in pandas in the most efficient way?
The complexity is O(n). What will slow it down is doing a for loop in python. If there are k zeros in the series, and log k is negligibile comparing to the length of series, an O(n log k) solution would be:
>>> izero = np.r_[-1, (ts == 0).nonzero()[0]] # indices of zeros
>>> idx = np.arange(len(ts))
>>> idx - izero[np.searchsorted(izero - 1, idx) - 1]
array([1, 2, 0, 1, 2, 3, 4, 0, 1, 2])
A solution in Pandas is a little bit tricky, but could look like this (s is your Series):
>>> x = (s != 0).cumsum()
>>> y = x != x.shift()
>>> y.groupby((y != y.shift()).cumsum()).cumsum()
0 1
1 2
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
dtype: int64
For the last step, this uses the "itertools.groupby" recipe in the Pandas cookbook here.
A solution that may not be as performant (haven't really checked), but easier to understand in terms of the steps (at least for me), would be:
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
df
df['flag'] = np.where(df['X'] == 0, 0, 1)
df['cumsum'] = df['flag'].cumsum()
df['offset'] = df['cumsum']
df.loc[df.flag==1, 'offset'] = np.nan
df['offset'] = df['offset'].fillna(method='ffill').fillna(0).astype(int)
df['final'] = df['cumsum'] - df['offset']
df
It's sometimes surprising to see how simple it is to get c-like speeds for this stuff using Cython. Assuming your column's .values gives arr, then:
cdef int[:, :, :] arr_view = arr
ret = np.zeros_like(arr)
cdef int[:, :, :] ret_view = ret
cdef int i, zero_count = 0
for i in range(len(ret)):
zero_count = 0 if arr_view[i] == 0 else zero_count + 1
ret_view[i] = zero_count
Note the use of typed memory views, which are extremely fast. You can speed it further using #cython.boundscheck(False) decorating a function using this.
Another option
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
zeros = np.r_[-1, np.where(df.X == 0)[0]]
def d0(a):
return np.min(a[a>=0])
df.index.to_series().apply(lambda i: d0(i - zeros))
Or using pure numpy
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
a = np.arange(len(df))[:, None] - np.r_[-1 , np.where(df.X == 0)[0]][None]
np.min(a, where=a>=0, axis=1, initial=len(df))
Yet another way to do this using Numpy accumulate. The only catch is, to initialize the counter at zero you need to insert a zero infront of the series values.
import numpy as np
# Define Python function
f = lambda a, b: 0 if b == 0 else a + 1
# Convert to Numpy ufunc
npf = np.frompyfunc(f, 2, 1)
# Apply recursively over series values
x = npf.accumulate(np.r_[0, s.values])[1:]
print(x)
array([1, 2, 0, 1, 2, 3, 4, 0, 1, 2], dtype=object)
Here is a way without using groupby:
((v:=pd.Series([7,2,0,3,4,2,5,0,3,4]).ne(0))
.cumsum()
.where(v.eq(0)).ffill().fillna(0)
.rsub(v.cumsum())
.astype(int)
.tolist())
Output:
[1, 2, 0, 1, 2, 3, 4, 0, 1, 2]
Maybe pandas is not the best tool for this as in the answer by #behzad.nouri, however here is another variation:
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
z = df.ne(0).X
z.groupby((z != z.shift()).cumsum()).cumsum()
0 1
1 2
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
Name: X, dtype: int64
Solution 2:
If you write the following code you will get almost everything you need, except that the first row starts from 0 and not 1:
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
df.eq(0).cumsum().groupby('X').cumcount()
0 0
1 1
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
dtype: int64
This happened because cumulative sum starts the counting from 0. To get the desired results, I added a 0 to the first row, calculated everything and then dropped the 0 at the end to get:
x = pd.Series([0], index=[0])
df = pd.concat([x, df])
df.eq(0).cumsum().groupby('X').cumcount().reset_index(drop=True).drop(0).reset_index(drop=True)
0 1
1 2
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
dtype: int64