Accessing pandas multi-index with a variable - python

I'm struggling to access a Pandas DataFrame with a multi-index programatically. Let's say I have
import pandas as pd
df = pd.DataFrame([[0, 0, 0, 1],
[0, 0, 1, 2],
[0, 1, 0, 7],
[0, 1, 1, 9],
[1, 0, 0, 1],
[1, 0, 1, 0],
[1, 1, 0, 1],
[1, 1, 1, 10]], columns=['c1', 'c2', 'c3', 'value'])
sums = df.groupby(['c1', 'c2', 'c3']).value.sum()
I can get the sum which corresponds to the [1, 1, 1] combination of c1, c2 and c3 with
sums[1, 1, 1]
That returns 10 as expected.
But what if I have a variable
q = [1, 1, 1]
how do I get the same value out?
I have tried
sums[q]
which gives
c1 c2 c3
0 0 1 2
1 2
1 2
Name: value, dtype: int64
Also I thought star operator could work:
sums[*q]
but that is invalid syntax.

Use Series.xs with tuple:
print (sums.xs((1,1,1)))
10
Or Series.loc:
print (sums.loc[(1,1,1)])
#alternative
#print (sums[(1,1,1)])
10
q = [1, 1, 1]
print (sums.loc[tuple(q)])
#alternative
#print (sums[tuple(q)])
10

Related

With three lists, two of which are array coordinates, how do I create an array in python?

I have three lists (really columns in a pandas dataframe) one with data of interest, one with x array coordinates, and one with y array coordinates. All lists are the same length and their order in the list associated with the coordinates (so L1: "Apple" coincides with L2:"1", and L3:"A"). I would like to make an array with the dimensions provided by the two coordinate lists with data from the data list. What is the best way to do this?
The expected output would be in the form of a numpy array or something like:
array = [[0,0,0,3,0,0,2,3][0,0,0,0,0,0,0,3]] #databased on below
Where in this example the array has the dimensions of y = 2 from y.unique() and x = 8 from x.unique().
The following is example input data for what I am talking about:
array_x
array_y
Data
1
a
0
2
a
0
3
a
0
4
a
3
5
a
0
6
a
0
7
a
2
8
a
3
1
b
0
2
b
0
3
b
0
4
b
0
5
b
0
6
b
0
7
b
0
8
b
3
You may be looking for pivot:
out = df.pivot(values=['Data'], columns=['array_y'], index=['array_x']).to_numpy()
Output:
array([[0, 0],
[0, 0],
[0, 0],
[3, 0],
[0, 0],
[0, 0],
[2, 0],
[3, 3]], dtype=int64)
Supposing you have a dataframe like that:
import pandas as pd
import numpy as np
myDataframe = pd.DataFrame([[1,2],[3,4],[5,6]], columns=['x','y'])
Then you can select the columns you want and creat an array from it
my_array = np.array(myDataframe[['x','y']])
>>> my_array
array([[1, 2],
[3, 4],
[5, 6]], dtype=int64)
You could do a zip (note: I'm shorthand-ing some of your example data):
data_x = [1, 2, 3, 4, 5, 6, 7, 8] * 2
data_y = ['a'] * 8 + ['b'] * 8
data_vals = [0,0,0,3,0,0,2,3,0,0,0,0,0,0,0,3]
coll = dict()
for (x, y, val) in zip(data_x, data_y, data_vals):
if coll.get(y) is None:
coll[y] = []
if x > len(coll[y]):
coll[y].extend([0] * (x - len(coll[y])))
coll[y][x - 1] = val
result = []
for k in sorted(coll):
result.append(coll[k])
print coll
print result
Output:
{'a': [0, 0, 0, 3, 0, 0, 2, 3], 'b': [0, 0, 0, 0, 0, 0, 0, 3]}
[[0, 0, 0, 3, 0, 0, 2, 3], [0, 0, 0, 0, 0, 0, 0, 3]]

How do I construct an incidence matrix from two dataframe columns using scipy.sparse.coo_matrix((data, (i, j)))?

I have a pandas DataFrame containing two columns ['A', 'B']. Each column is made up of integers.
I want to construct a sparse matrix with the following properties:
row index is all integers from 0 to the max value in the dataframe
column index is the same as row index
entry i,j = 1 if [i,j] or [j,i] is a row of my dataframe (1 should be the max value of the matrix).
Most importantly, I want to do this using
coo_matrix((data, (i, j)))
from scipy.sparse as I'm trying to understand this constructor and this particular way of using it. I have never worked with sparse matrices before. I've tried a few things but none of them is working.
EDIT
Sample code
Defining the dataframe
In [96]: df = pd.DataFrame(np.random.randint(5, size=(10,2)))
In [97]: df.columns = ['a', 'b']
In [98]: df
Out[98]:
a b
0 0 3
1 1 4
2 3 3
3 2 0
4 0 2
5 1 0
6 1 1
7 2 3
8 3 4
9 3 2
The closest I've come to a solution
In [100]: scipy.sparse.coo_matrix((np.ones_like(df['a']), (df['a'].array, df['b'
...: ].array))).toarray()
Out[100]:
array([[0, 0, 1, 1, 0],
[1, 1, 0, 0, 1],
[1, 0, 0, 1, 0],
[0, 0, 1, 1, 1]])
The problem is this isn't a symmetric matrix (as it doesn't add to both i,j and j,i for a given row) and I think it would give values greater than 1 if there were duplicate rows.
import numpy as np
import pandas as pd
from scipy.sparse import coo_matrix
df = pd.DataFrame(np.random.default_rng(seed=100).integers(5, size=(10,2)))
df.columns = ['a', 'b']
arr = coo_matrix((np.ones_like(df.a), (df.a.values, df.b.values)))
This is what you've got. It gives you i,j >= 1 if [i,j] is in df.
arr = arr + arr.T
array([[0, 1, 2, 2, 0],
[1, 0, 0, 0, 0],
[2, 0, 0, 1, 2],
[2, 0, 1, 0, 1],
[0, 0, 2, 1, 2]])
Now i,j >= 1 if [i,j] or [j,i] is in df.
arr.data = np.ones_like(arr.data)
Now i,j = 1 if [i,j] or [j,i] is in df.
array([[0, 1, 1, 1, 0],
[1, 0, 0, 0, 0],
[1, 0, 0, 1, 1],
[1, 0, 1, 0, 1],
[0, 0, 1, 1, 1]])

How can I find the value with the minimum MSE with a numpy array?

My possible values are:
0: [0 0 0 0]
1: [1 0 0 0]
2: [1 1 0 0]
3: [1 1 1 0]
4: [1 1 1 1]
I have some values:
[[0.9539342 0.84090066 0.46451256 0.09715253],
[0.9923432 0.01231235 0.19491441 0.09715253]
....
I want to figure out which of my possible values this is the closest to my new values. Ideally I want to avoid doing a for loop and wonder if there's some sort of vectorized way to search for the minimum mean squared error?
I want it to return an array that looks like: [2, 1 ....
You can use np.argmin to get the lowest index of the rmse value which can be calculated using np.linalg.norm
import numpy as np
a = np.array([[0, 0, 0, 0], [1, 0, 0, 0], [1, 1, 0, 0],[1, 1, 1, 0], [1, 1, 1, 1]])
b = np.array([0.9539342, 0.84090066, 0.46451256, 0.09715253])
np.argmin(np.linalg.norm(a-b, axis=1))
#outputs 2 which corresponds to the value [1, 1, 0, 0]
As mentioned in the edit, b can have multiple rows. The op wants to avoid for loop, but I can't seem to find a way to avoid the for loop. Here is a list comp way, but there could be a better way
[np.argmin(np.linalg.norm(a-i, axis=1)) for i in b]
#Outputs [2, 1]
Let's assume your input data is a dictionary. You can then use NumPy for a vectorized solution. You first convert your input lists to a NumPy array and the use axis=1 argument to get the RMSE.
# Input data
dicts = {0: [0, 0, 0, 0], 1: [1, 0, 0, 0], 2: [1, 1, 0, 0], 3: [1, 1, 1, 0],4: [1, 1, 1, 1]}
new_value = np.array([0.9539342, 0.84090066, 0.46451256, 0.09715253])
# Convert values to array
values = np.array(list(dicts.values()))
# Compute the RMSE and get the index for the least RMSE
rmse = np.mean((values-new_value)**2, axis=1)**0.5
index = np.argmin(rmse)
print ("The closest value is %s" %(values[index]))
# The closest value is [1 1 0 0]
Pure numpy:
val1 = np.array ([
[0, 0, 0, 0],
[1, 0, 0, 0],
[1, 1, 0, 0],
[1, 1, 1, 0],
[1, 1, 1, 1]
])
print val1
val2 = np.array ([0.9539342, 0.84090066, 0.46451256, 0.09715253], float)
val3 = np.round(val2, 0)
print val3
print np.where((val1 == val3).all(axis=1)) # show a match on row 2 (array([2]),)

pandas groupby and update the sum of the number of times the values in one column is greater than the other column

I have a dataset in the following format
df = pd.DataFrame([[1, 'Label1', 0, 8, 2], [1, 'Label3', 0, 20, 5], [2, 'Label5', 1, 20, 2], [2, 'Label4', 1, 11, 0],
[5, 'Label2', 0, 0, -4],[1, 'Label2', 1, 8, 2], [2, 'Label5', 0, 20, 5], [3, 'Label2', 1, 20, 2], [4, 'Label4', 0, 1, 0],
[5, 'Label3', 0, 1, -4],[1, 'Label3', 1, 8, 2], [2, 'Label4', 0, 20, 5], [3, 'Label1', 1, 20, 2], [4, 'Label3', 0, 1, 0],
[5, 'Label4', 0, 1, -4],[1, 'Label4', 1, 8, 2], [2, 'Label3', 0, 20, 5], [3, 'Label3', 1, 20, 2], [4, 'Label5', 0, 1, 0],
[5, 'Label5', 0, 1, -4]],
columns=['ID', 'Label', 'Status', 'Coeff', 'result'])
cm = {'TP': 0,'FP': 0}
For each ID in df, I would like to find the number of times the column Coeff is greater than Result when Status column is 1. If this count is greater than 3 then TP should be incremented by 1 and if it less than 3, then FP should be incremented by 1.
Example: When ID is 1111 and Status 1, if the Coeff column is greater than Result column twice for that particular ID, then FP must be incremented by 1.
I tried to add a new column called count for each ID and assigned a value as 1 everytime the column Coeff was greater than Result.
for ID in df.groupby('ID'):
df.loc[(df['Coeff'] > df['Result']), 'count'] = 1
df_new = list(df[['ID','count']].groupby(df['ID']))
Then I thought of finding whether count has the number 1 in it. If it does, then increment TP. Otherwise, increment FP.
But I couldn't achieve it.
How do I get the required result?
A simple grouping operation on a masked comparison should do:
v = df.Coeff.gt(df.result).where(df.Status.astype(bool)).groupby(df.ID).sum()
Or (to retain dtype=int, thanks piR!),
v = df.Coeff.gt(df.result).where(df.Status.astype(bool), 0).groupby(df.ID).sum()
v # second expression result
ID
1 3
2 2
3 3
4 0
5 0
dtype: int64
Now,
cm['TP'] = v.gt(3).sum()
cm['FP'] = v.lt(3).sum()
Details
df.Coeff.gt(df.result) returns a mask. Now, hide all those values for which df.Status is not 1. This is done using (df.Coeff > df.result).where(df.Status.astype(bool)). Finally, take this masked result, and group on ID, followed by a sum to get your result.

Counting combinations over pairs of columns in a numpy array

I have a matrix with a certain number of columns that contain only the numbers 0 and 1, I want to count the number of [0, 0], [0, 1], [1, 0], and [1, 1] in each PAIR of columns.
So for example, if I have a matrix with four columns, I want to count the number of 00s, 11s, 01s, and 11s in the first and second column, append the final result to a list, then loop over the 3rd and 4th column and append that answer to the list.
Example input:
array([[0, 1, 1, 0],
[1, 0, 1, 0],
[0, 1, 0, 1],
[0, 0, 1, 1],
[1, 1, 0, 0]])
My expected output is:
array([[1, 1],
[2, 1],
[1, 2],
[1, 1]])
Explanation:
The first two columns have [0, 0] once. The second two columns also have [0, 0] once. The first two columns have [0, 1] twice, and the second two columns have [0, 1] once... and so on.
This is my latest attempt and it seems to work. Would like feedback.
# for each pair of columns calculate haplotype frequencies
# haplotypes:
# h1 = 11
# h2 = 10
# h3 = 01
# h4 = 00
# takes as input a pair of columns
def calc_haplotype_freq(matrix):
h1_frequencies = []
h2_frequencies = []
h3_frequencies = []
h4_frequencies = []
colIndex1 = 0
colIndex2 = 1
for i in range(0, 2): # number of columns divided by 2
h1 = 0
h2 = 0
h3 = 0
h4 = 0
column_1 = matrix[:, colIndex1]
column_2 = matrix[:, colIndex2]
for row in range(0, matrix.shape[0]):
if (column_1[row, 0] == 1).any() & (column_2[row, 0] == 1).any():
h1 += 1
elif (column_1[row, 0] == 1).any() & (column_2[row, 0] == 0).any():
h2 += 1
elif (column_1[row, 0] == 0).any() & (column_2[row, 0] == 1).any():
h3 += 1
elif (column_1[row, 0] == 0).any() & (column_2[row, 0] == 0).any():
h4 += 1
colIndex1 += 2
colIndex2 += 2
h1_frequencies.append(h1)
h2_frequencies.append(h2)
h3_frequencies.append(h3)
h4_frequencies.append(h4)
print("H1 Frequencies (11): ", h1_frequencies)
print("H2 Frequencies (10): ", h2_frequencies)
print("H3 Frequencies (01): ", h3_frequencies)
print("H4 Frequencies (00): ", h4_frequencies)
For the sample input above, this gives:
----------
H1 Frequencies (11): [1, 1]
H2 Frequencies (10): [1, 2]
H3 Frequencies (01): [2, 1]
H4 Frequencies (00): [1, 1]
----------
Which is correct, but is there a better way to do this? How can I return these results from the function for further processing?
Starting with this -
x
array([[0, 1, 1, 0],
[1, 0, 1, 0],
[0, 1, 0, 1],
[0, 0, 1, 1],
[1, 1, 0, 0]])
Split your array into groups of 2 columns and concatenate them:
y = x.T
z = np.concatenate([y[i:i + 2] for i in range(0, y.shape[0], 2)], 1).T
Now, perform a broadcasted comparison and sum:
(z[:, None] == [[0, 0], [0, 1], [1, 0], [1, 1]]).all(2).sum(0)
array([2, 3, 3, 2])
If you want a per-column pair count, then you could do something like this:
def calc_haplotype_freq(x):
counts = []
for i in range(0, x.shape[1], 2):
counts.append(
(x[:, None, i:i + 2] == [[0, 0], [0, 1], [1, 0], [1, 1]]).all(2).sum(0)
)
return np.column_stack(counts)
calc_haplotype_freq(x)
array([[1, 1],
[2, 1],
[1, 2],
[1, 1]])

Categories