I read about pandas multi index, but the examples I found did not cover this cases:
I have a set of measurements i in 1..n. Each measurement i consists of attibutes a, b, c, X, Y, Z. While a, b, c have scalar values, X, Y, Z are arrays.
(X, Y, Z have different length for different measurements, but within one measurement i, the arrays X, Y, Z have same length m).
Question: Whats the best way to represent this in a pandas data-frame?
The multi-index examples I saw would index data e.g. first level by i for the measurement and second level by k for index into X,Y,Z. But what about the attributes a, b, c that have just one value for each measurement, but not m? Should the a, b, c values be repeated? Or only the first row of each measurement i contains values for a, b, c, the other 2..m rows contain NaN?
As alluded to in the comment, much of this will depend on the sort of questions you want to be able to answer at the end of the day. One example of a potentially meaningful representation, which takes advantage of the facts that the arrays have the same length within each row, would be to take the product along the arrays, repeating the $i$th row $m_i$ times;
In [12]: from itertools import chain
In [13]: df = pd.DataFrame({'a': [5, 7, 1], 'b': [1, 3, 2], 'c': [10, 1, 1], 'X': [[10, 10], [10, 20, 30], [4, 5]], 'Y': [[20
...: , 30], [1, 4, 5], [0, 1]], 'Z': [[0, 3], [0, 1, 2], [50, 60]]})
In [14]: df
Out[14]:
X Y Z a b c
0 [10, 10] [20, 30] [0, 3] 5 1 10
1 [10, 20, 30] [1, 4, 5] [0, 1, 2] 7 3 1
2 [4, 5] [0, 1] [50, 60] 1 2 1
In [15]: pd.DataFrame({
...: 'a': np.repeat(df.a.values, df.X.apply(len)),
...: 'b': np.repeat(df.b.values, df.X.apply(len)),
...: 'c': np.repeat(df.c.values, df.X.apply(len)),
...: 'X': list(chain.from_iterable(df.X)),
...: 'Y': list(chain.from_iterable(df.Y)),
...: 'Z': list(chain.from_iterable(df.Z))})
...:
Out[15]:
X Y Z a b c
0 10 20 0 5 1 10
1 10 30 3 5 1 10
2 10 1 0 7 3 1
3 20 4 1 7 3 1
4 30 5 2 7 3 1
5 4 0 50 1 2 1
6 5 1 60 1 2 1
This assumes that the fact that the array lengths match across columns means that the arrays themselves are comparable element-wise.
Related
I would like to combine all row values into a list, whenever a non-null string is found in another column.
For example if I have this pandas dataframe:
df = pd.DataFrame({'X': [1,2,3,4,5,6,7,8],
'Y': [10,20,30,40,50,60,70,80],
'Z': [np.nan, np.nan, "A", np.nan, "A", "B", np.nan, np.nan]})
X Y Z
0 1 10 NaN
1 2 20 NaN
2 3 30 A
3 4 40 NaN
4 5 50 A
5 6 60 B
6 7 70 NaN
7 8 80 NaN
I would like to combine all previous row values from columns X and Y into lists, whenever column Z has a non-null string, like this:
df = pd.DataFrame({'X': [[1,2,3],[4,5],[6]],
'Y': [[10,20,30],[40,50],[60]],
'Z': ["A","A", "B"]})
X Y Z
0 [1, 2, 3] [10, 20, 30] A
1 [4, 5] [40, 50] A
2 [6] [60] B
So what I managed to do is "solve" it by using for loops. I would hope there is a better way to do it with pandas but I can't seem to find it.
My for loop solution:
Get "Z" ids without NaNs:
z_idx_withoutNaN = df[~df["Z"].isnull() == True].index.tolist()
[2, 4, 5]
Loop over ids and create lists with "X" and "Y" values:
x_list = []
y_list = []
for i, index in enumerate(z_idx_withoutNaN):
if i == 0:
x_list = [df.iloc[:index+1]["X"].values.tolist()]
y_list = [df.iloc[:index+1]["Y"].values.tolist()]
else:
x_list.append(df.iloc[previous_index:index+1]["X"].values.tolist())
y_list.append(df.iloc[previous_index:index+1]["Y"].values.tolist())
previous_index = index + 1
Finally, create df:
pd.DataFrame({"X": x_list,
"Y": y_list,
"Z": df[~df["Z"].isnull()]["Z"].values.tolist()})
X Y Z
0 [1, 2, 3] [10, 20, 30] A
1 [4, 5] [40, 50] A
2 [6] [60] B
Let us do
out = (df.groupby(df['Z'].iloc[::-1].notna().cumsum()).
agg({'X':list,'Y':list,'Z':'first'}).
dropna().
sort_index(ascending=False))
Out[23]:
X Y Z
Z
3 [1, 2, 3] [10, 20, 30] A
2 [4, 5] [40, 50] A
1 [6] [60] B
Here is one option:
(df.groupby(
df.Z.shift().notnull().cumsum()
).agg(list)
.assign(Z = lambda x: x.Z.str[-1])[
lambda x: x.Z.notnull()
])
X Y Z
Z
0 [1, 2, 3] [10, 20, 30] A
1 [4, 5] [40, 50] A
2 [6] [60] B
I have a dataframe of N columns. Each element in the dataframe is in the range 0, N-1.
For example, my dataframce can be something like (N=3):
A B C
0 0 2 0
1 1 0 1
2 2 2 0
3 2 0 0
4 0 0 0
I want to create a co-occurrence matrix (please correct me if there is a different standard name for that) of size N x N which each element ij contains the number of times that element i and j assume the same value.
A B C
A x 2 3
B 2 x 2
C 3 2 x
Where, for example, matrix[0, 1] means that A and B assume the same value 2 times.
I don't care about the value on the diagonal.
What is the smartest way to do that?
DataFrame.corr
We can define a custom callable function for calculating the correlation between the columns of the dataframe, this callable takes two 1D numpy arrays as its input arguments and return's the count of the number of times the elements in these two arrays equal to each other
df.corr(method=lambda x, y: (x==y).sum())
A B C
A 1.0 2.0 3.0
B 2.0 1.0 2.0
C 3.0 2.0 1.0
Let's try broadcasting across the transposition and summing axis 2:
import pandas as pd
df = pd.DataFrame({
'A': {0: 0, 1: 1, 2: 2, 3: 2, 4: 0},
'B': {0: 2, 1: 0, 2: 2, 3: 0, 4: 0},
'C': {0: 0, 1: 1, 2: 0, 3: 0, 4: 0}
})
vals = df.T.values
e = (vals[:, None] == vals).sum(axis=2)
new_df = pd.DataFrame(e, columns=df.columns, index=df.columns)
print(new_df)
e:
[[5 2 3]
[2 5 2]
[3 2 5]]
Turn back into a dataframe:
new_df = pd.DataFrame(e, columns=df.columns, index=df.columns)
new_df:
A B C
A 5 2 3
B 2 5 2
C 3 2 5
I don't know about the smartest way but I think this works:
import numpy as np
m = np.array([[0, 2, 0], [1, 0, 1], [2, 2, 0], [2, 0, 0], [0, 0, 0]])
n = 3
ans = np.zeros((n, n))
for i in range(n):
for j in range(i+1, n):
ans[i, j] = len(m) - np.count_nonzero(m[:, i] - m[:, j])
print(ans + ans.T)
The problem is very similar to that in How to evaluate the sum of values within array blocks, where I need to sum up elements in a matrix by blocks. What is different here is that the blocks could have different sizes. For example, given a 4-by-5 matrix
1 1 | 1 1 1
----|------
1 1 | 1 1 1
1 1 | 1 1 1
1 1 | 1 1 1
and block sizes 1 3 along the rows and 2 3 along the columns, the result should be a 2-by-2 matrix:
2 3
6 9
Is there a way of doing this without loops?
Seems like a good fit to use np.add.reduceat to basically sum along rows and then along cols -
def sum_blocks(a, row_sizes, col_sizes):
# Sum rows based on row-sizes
s1 = np.add.reduceat(a,np.r_[0,row_sizes[:-1].cumsum()],axis=0)
# Sum cols from row-summed output based on col-sizes
return np.add.reduceat(s1,np.r_[0,col_sizes[:-1].cumsum()],axis=1)
Sample run -
In [45]: np.random.seed(0)
...: a = np.random.randint(0,9,(4,5))
In [46]: a
Out[46]:
array([[5, 0, 3, 3, 7],
[3, 5, 2, 4, 7],
[6, 8, 8, 1, 6],
[7, 7, 8, 1, 5]])
In [47]: row_sizes = np.array([1,3])
...: col_sizes = np.array([2,3])
In [48]: sum_blocks(a, row_sizes, col_sizes)
Out[48]:
array([[ 5, 13],
[36, 42]])
I have a df that "packed" and I'm trying to find a way to unpack into multiple columns and rows:
input as a df with multiple list within a column
all_labels values labels
[A,B,C] [[10,1,3],[5,6,3],[0,0,0]] [X,Y,Z]
desired output: unpacked df
X Y Z
A 10 1 3
B 5 6 3
C 0 0 0
I tried this for all_labels & labels column but not sure how to do it for values column :
df.labels.apply(pd.Series)
df.all_labels.apply(pd.Series)
Setup
packed = pd.DataFrame({
'all_labels': [['A', 'B', 'C']],
'values': [[[10, 1, 3], [5, 6, 3], [0, 0, 0]]],
'labels': [['X', 'Y', 'Z']]
})
Keep It Simple
pd.DataFrame(packed['values'][0], packed['all_labels'][0], packed['labels'][0])
X Y Z
A 10 1 3
B 5 6 3
C 0 0 0
rename and dict unpacking
The columns are so close to the argument names of the dataframe constructor, I couldn't resist...
rnm = {'all_labels': 'index', 'values': 'data', 'labels': 'columns'}
pd.DataFrame(**packed.rename(columns=rnm).loc[0])
X Y Z
A 10 1 3
B 5 6 3
C 0 0 0
Without rename and list unpacking instead
Making sure to list the column names in the same order the arguments are expected in the pandas.DataFrame constructor
pd.DataFrame(*packed.loc[0, ['values', 'all_labels', 'labels']])
X Y Z
A 10 1 3
B 5 6 3
C 0 0 0
Bonus Material
The pandas.DataFrame.to_dict method will return a dictionary that looks similar to this.
df = pd.DataFrame(*packed.loc[0, ['values', 'all_labels', 'labels']])
df.to_dict('split')
{'index': ['A', 'B', 'C'],
'columns': ['X', 'Y', 'Z'],
'data': [[10, 1, 3], [5, 6, 3], [0, 0, 0]]}
That we could wrap in another dataframe constructor call to get back something very similar to what we started with.
pd.DataFrame([df.to_dict('split')])
index columns data
0 [A, B, C] [X, Y, Z] [[10, 1, 3], [5, 6, 3], [0, 0, 0]]
I have a data frame say df1 with MULTILEVEL INDEX:
A B C D
0 0 0 1 2 3
4 5 6 7
1 2 8 9 10 11
3 2 3 4 5
and I have another data frame with 2 common columns in df2 also with MULTILEVEL INDEX
X B C Y
0 0 0 0 7 3
1 4 5 6 7
1 2 8 2 3 11
3 2 3 4 5
I need to remove the rows from df1 where the values of column B and C are the same as in df2, so I should be getting something like this:
A B C D
0 0 0 1 2 3
0 2 8 9 10 11
I have tried to do this by getting the index of the common elements and then remove them via a list, but they are all messed up and are in multi-level form.
You can do this in a one liner using pandas.dataframe.iloc, numpy.where and numpy.logical_or like this: (I find it to be the simplest way)
df1 = df1.iloc[np.where(np.logical_or(df1['B']!=df2['B'],df1['C']!=df2['C']))]
of course don't forget to:
import numpy as np
output:
A B C D
0 0 0 1 2 3
1 2 8 9 10 11
Hope this was helpful. If there are any questions or remarks please feel free to comment.
You could make MultiIndexes out of the B and C columns, and then call the index's isin method:
idx1 = pd.MultiIndex.from_arrays([df1['B'],df1['C']])
idx2 = pd.MultiIndex.from_arrays([df2['B'],df2['C']])
mask = idx1.isin(idx2)
result = df1.loc[~mask]
For example,
import pandas as pd
df1 = pd.DataFrame({'A': [0, 4, 8, 2], 'B': [1, 5, 9, 3], 'C': [2, 6, 10, 4], 'D': [3, 7, 11, 5], 'P': [0, 0, 1, 1], 'Q': [0, 0, 2, 3]})
df1 = df1.set_index(list('PQ'))
df1.index.names = [None,None]
df2 = pd.DataFrame({'B': [0, 5, 2, 3], 'C': [7, 6, 3, 4], 'P': [0, 0, 1, 1], 'Q': [0, 1, 2, 3], 'X': [0, 4, 8, 2], 'Y': [3, 7, 11, 5]})
df2 = df2.set_index(list('PQ'))
df2.index.names = [None,None]
idx1 = pd.MultiIndex.from_arrays([df1['B'],df1['C']])
idx2 = pd.MultiIndex.from_arrays([df2['B'],df2['C']])
mask = idx1.isin(idx2)
result = df1.loc[~mask]
print(result)
yields
A B C D
0 0 0 1 2 3
1 2 8 9 10 11