Elegant way to perform column-wise operations on two dataframe - python

I need to find all-pair column-wise operation on a dataframe. I came up with a naive solution but wondering if any elegant way is available.
The following script counts the number rows having one in both columns.
input:
a b c d
0 0 0 1 0
1 1 1 0 1
2 1 1 1 0
Output:
2 2 1 1
2 2 1 1
1 1 2 0
1 1 0 1
Code:
df = DataFrame(random.randint(0,high=2, size=(3,4)), columns=['a','b', 'c', 'd'])
mycolumns = df.columns
for i in range(0, shape(df)[1]):
for j in range(0, shape(df)[1]):
print(sum(df[mycolumns[i]] & df[mycolumns[j]]))

That is basically matrix multiplication of X' and X where X' is transpose of X:
>>> xs = df.values
>>> xs.T.dot(xs)
array([[2, 2, 1, 1],
[2, 2, 1, 1],
[1, 1, 2, 0],
[1, 1, 0, 1]])

Related

Inplace operation on specific lines and columns of dataframe

Say I have a dataframe with negative values on specific columns:
df = pd.DataFrame([[1, 1, -1],[-1, 1, 1],[-1, -1, 1]])
Now, I want to inplace clip the negative values to 0 on only specific lines and columns:
df.loc[[1, 2], [0, 1]].clip(lower=0, inplace=True)
But this doesn't work:
df
Out:
0 1 2
0 1 1 -1
1 -1 1 1
2 -1 -1 1
This is because slicing dataframe with a list of integers returns a copy:
df.loc[[1, 2], [0, 1]] is df.loc[[1, 2], [0, 1]]
Out: False
How do I make inplace changes to specific rows and columns then?
How about using df.lt instead:
df[df.loc[[1, 2], [0, 1]].lt(0)] = 0
print(df)
0 1 2
0 1 1 -1
1 0 1 1
2 0 0 1
You can do this:
df.loc[[1, 2], [0, 1]] = df.loc[[1, 2], [0, 1]].clip(lower=0)
Output:
0 1 2
0 1 1 -1
1 0 1 1
2 0 0 1

Find the rows that share the value

I need to find where the rows in ABC all have the value 1 and then create a new column that has the result.
my idea is to use np.where() with some condition, but I don't know the correct way of dealing with this problem, from what I have read I'm not supposed to iterate through a dataframe, but use some of the pandas creative methods?
df1 = pd.DataFrame({'A': [0, 1, 1, 0],
'B': [1, 1, 0, 1],
'C': [0, 1, 1, 1],},
index=[0, 1, 2, 4])
print(df1)
what I am after is this:
A B C TRUE
0 0 1 0 0
1 1 1 1 1 <----
2 1 0 1 0
4 0 1 1 0
If the data is always 0/1, you can simply take the product per row:
df1['TRUE'] = df1.prod(1)
output:
A B C TRUE
0 0 1 0 0
1 1 1 1 1
2 1 0 1 0
4 0 1 1 0
This is what you are looking for:
df1["TRUE"] = (df1==1).all(axis=1).astype(int)

Apply function rowwise to pandas dataframe while referencing a column

I have a pandas dataframe like this:
df = pd.DataFrame({'A': [2, 3], 'B': [1, 2], 'C': [0, 1], 'D': [1, 0], 'total': [4, 6]})
A B C D total
0 2 1 0 1 4
1 3 2 1 0 6
I'm trying to perform a rowwise calculation and create a new column with the result. The calculation is to divide each column ABCD by the total, square it, and sum it up rowwise. This should be the result (0 if total is 0):
A B C D total result
0 2 1 0 1 4 0.375
1 3 2 1 0 6 0.389
This is what I've tried so far, but it always returns 0:
df['result'] = df[['A', 'B', 'C', 'D']].apply(lambda x: ((x/df['total'])**2).sum(), axis=1)
I guess the problem is df['total'] in the lambda function, because if I replace this by a number it works fine. I don't know how to work around this though. Appreciate any suggestions.
A combination of div, pow and sum can solve this :
df["result"] = df.filter(regex="[^total]").div(df.total, axis=0).pow(2).sum(1)
df
A B C D total result
0 2 1 0 1 4 0.375000
1 3 2 1 0 6 0.388889
you could do
df['result'] = (df.loc[:, "A": 'D'].divide(df.total, axis=0) ** 2).sum(axis=1)

Why I can't get the correct mask columns in pandas

For example, if I have a data frame
x f
0 0 [0, 1]
1 1 [3]
2 2 [2, 3, 4]
3 3 [3, 6]
4 4 [4, 5]
If I want to remove the rows which columns x doesn't in f columns, I tried with where and apply but I can't get the expected results. I got the below table and I want to know why row 0,2,3 are 0 instead of 1?
x f mask
0 0 [0, 1] 0
1 1 [3] 0
2 2 [2, 3, 4] 0
3 3 [3, 6] 0
4 4 [4, 5] 0
Anyone knows why? And should I do to handle this number vs list case?
df1 = pd.DataFrame({'x': [0,1,2,3,4],'f' :[[0,1],[3],[2,3,4],[3,6],[3,5]]}, index = [0,1,2,3,4])
df1['mask'] = np.where(df1.x.values in df1.f.values ,1,0)
Here is necessary test values by pairs - solution with in in list comprehension:
df1['mask'] = np.where([a in b for a, b in df1[['x', 'f']].values],1,0)
Or with DataFrame.apply and axis=1:
df1['mask'] = np.where(df1.apply(lambda x: x.x in x.f, axis=1),1,0)
print (df1)
x f mask
0 0 [0, 1] 1
1 1 [3] 0
2 2 [2, 3, 4] 1
3 3 [3, 6] 1
4 4 [3, 5] 0
IIUC row explode then use isin
pd.DataFrame(df1.f.tolist()).isin(df1.x).any(1).astype(int)
Out[10]:
0 1
1 0
2 1
3 1
4 0
dtype: int32
df1['mask'] = pd.DataFrame(df1.f.tolist()).isin(df1.x).any(1).astype(int)

Counting and printing zeroes and negative values for each Column in a Dataframe

I'd like to print a statement showing me both zero and negative value counts in each Dataframe column.
My input would be something like:
import pandas as pd
df = pd.DataFrame({'a': [-3, -2, 0], 'b': [-2, 2, 5], 'c': [-1, 0, 7], 'd': [1, 4, 8]})
Which prints:
a b c d
0 -3 -2 -1 1
1 -2 2 0 4
2 0 5 7 8
The outputs I desire are:
Negatives Found:
a 2
b 1
c 1
d 0
Zeros Found:
a 1
b 0
c 1
d 0
I can't find an easy way to get to this without creating a Dataframe from the Dataframe using something like:
df_neg = df < 0
df_zero = df == 0
However, this only counts if True or False.
What's the best way of doing a count that is printable and 'easy' to run on bigger data sets?
This is somewhat what you tried:
Negatives:
(df<0).sum()
Zeros:
(df==0).sum()
If this isn't good for you, and you really don't want to generate a mask of booleans and count them (thought I'm not sure why it would bother you), let me know, you can get the same results with loops
You could use where and count
df.where(condition).count()
df = pd.DataFrame({'a': [-3, -2, 0], 'b': [-2, 2, 5], 'c': [-1, 0, 7], 'd': [1, 4, 8]})
print('Negatives Found:')
print(df.where(df < 0).count())
print('Zeros Found:')
print(df.where(df == 0).count())
This prints
Negatives Found:
a 2
b 1
c 1
d 0
Zeros Found:
a 1
b 0
c 1
d 0
You can simply:
print(df[df<0].count())
print(df[df==0].count())
a 2
b 1
c 1
d 0
dtype: int64
a 1
b 0
c 1
d 0
dtype: int64

Categories