Finding elements in a pandas dataframe - python

I have a pandas dataframe which looks like the following:
0 1
0 2
2 3
1 4
What I want to do is the following: if I get 2 as input my code is supposed to search for 2 in the dataframe and when it finds it returns the value of the other column. In the above example my code would return 0 and 3. I know that I can simply look at each row and check if any of the elements is equal to 2 but I was wondering if there is one-liner for such a problem.
UPDATE: None of the columns are index columns.
Thanks

>>> df = pd.DataFrame({'A': [0, 0, 2, 1], 'B': [1,2,3,4]})
>>> df
A B
0 0 1
1 0 2
2 2 3
3 1 4
The following pandas syntax is equivalent to the SQL SELECT B FROM df WHERE A = 2
>>> df[df['A'] == 2]['B']
2 3
Name: B, dtype: int64
There's also pandas.DataFrame.query:
>>> df.query('A == 2')['B']
2 3
Name: B, dtype: int64

You may need this:
n_input = 2
df[(df == n_input).any(1)].stack()[lambda x: x != n_input].unique()
# array([0, 3])

df = pd.DataFrame({'A': [0, 0, 2, 1], 'B': [1,2,3,4]})
t = [df.loc[lambda df: df['A'] == 3]]
t

Related

Efficient way in Pandas to count occurrences of Series of values by row

I have a large dataframe for which I want to count the number of occurrences of a series specific values (given by an external function) by row. For reproducibility let's assume the following simplified dataframe:
data = {'A': [3, 2, 1, 0], 'B': [4, 3, 2, 1], 'C': [1, 2, 3, 4], 'D': [1, 1, 2, 2], 'E': [4, 4, 4, 4]}
df = pd.DataFrame.from_dict(data)
df
A B C D E
0 3 4 1 1 4
1 2 3 2 1 3
2 1 2 3 2 2
3 0 1 4 2 4
How can I count the number of occurrences of specific values (given by a series with the same size) by row?
Again for simplicity, let's assume this value_series is given by the max of each row.
values_series = df.max(axis=1)
0 4
1 3
2 3
3 4
dtype: int64
The solution I got to seems not very pythonic (e.g. I'm using iterrows(), which is slow):
max_count = []
for index, row in df.iterrows():
max_count.append(row.value_counts()[values_series.loc[index]])
df_counts = pd.Series(max_count)
Is there any more efficient way to do this?
We can compare the transposed df.T directly to the df.max series, thanks to broadcasting:
(df.T == df.max(axis=1)).sum()
# result
0 2
1 1
2 1
3 2
dtype: int64
(Transposing also has the added benefit that we can use sum without specifying the axis, i.e. with the default axis=0.)
You can try
df.eq(df.max(1),axis=0).sum(1)
Out[361]:
0 2
1 1
2 1
3 2
dtype: int64
The perfect job for numpy broadcasting:
a = df.to_numpy()
b = values_series.to_numpy()[:, None]
(a == b).sum(axis=1)

Add another column based on the value of two columns

I am trying to add another column based on the value of two columns. Here is the mini version of my dataframe.
data = {'current_pair': ['"["StimusNeu/2357.jpg","StimusNeu/5731.jpg"]"', '"["StimusEmo/6350.jpg","StimusEmo/3230.jpg"]"', '"["StimusEmo/3215.jpg","StimusEmo/9570.jpg"]"','"["StimusNeu/7020.jpg","StimusNeu/7547.jpg"]"', '"["StimusNeu/7080.jpg","StimusNeu/7179.jpg"]"'],
'B': [1, 0, 1, 1, 0]
}
df = pd.DataFrame(data)
df
current_pair B
0 "["StimusNeu/2357.jpg","StimusNeu/5731.jpg"]" 1
1 "["StimusEmo/6350.jpg","StimusEmo/3230.jpg"]" 0
2 "["StimusEmo/3215.jpg","StimusEmo/9570.jpg"]" 1
3 "["StimusNeu/7020.jpg","StimusNeu/7547.jpg"]" 1
4 "["StimusNeu/7080.jpg","StimusNeu/7179.jpg"]" 0
I want the result to be:
current_pair B C
0 "["StimusNeu/2357.jpg","StimusNeu/5731.jpg"]" 1 1
1 "["StimusEmo/6350.jpg","StimusEmo/3230.jpg"]" 0 2
2 "["StimusEmo/3215.jpg","StimusEmo/9570.jpg"]" 1 0
3 "["StimusNeu/7020.jpg","StimusNeu/7547.jpg"]" 1 1
4 "["StimusNeu/7080.jpg","StimusNeu/7179.jpg"]" 0 2
I used the numpy select commands:
conditions=[(data['B']==1 & data['current_pair'].str.contains('Emo/', na=False)),
(data['B']==1 & data['current_pair'].str.contains('Neu/', na=False)),
data['B']==0]
choices = [0, 1, 2]
data['C'] = np.select(conditions, choices, default=np.nan)
Unfortunately, it gives me this dataframe without recognizing anything with "1" in column "C".
current_pair B C
0 "["StimusNeu/2357.jpg","StimusNeu/5731.jpg"]" 1 0
1 "["StimusEmo/6350.jpg","StimusEmo/3230.jpg"]" 0 2
2 "["StimusEmo/3215.jpg","StimusEmo/9570.jpg"]" 1 0
3 "["StimusNeu/7020.jpg","StimusNeu/7547.jpg"]" 1 0
4 "["StimusNeu/7080.jpg","StimusNeu/7179.jpg"]" 0 2
Any help counts! thanks a lot.
There is problem with () after ==1 for precedence of operators:
conditions=[(data['B']==1) & data['current_pair'].str.contains('Emo/', na=False),
(data['B']==1) & data['current_pair'].str.contains('Neu/', na=False),
data['B']==0]
I think some logic went wrong here; this works:
df.assign(C=np.select([df.B==0, df.current_pair.str.contains('Emo/'), df.current_pair.str.contains('Neu/')], [2,0,1]))
Here is a slightly more generalized suggestion, easily applicable to more complex cases. You should, however mind execution speed:
import pandas as pd
df = pd.DataFrame({'col_1': ['Abc', 'Xcd', 'Afs', 'Xtf', 'Aky'], 'col_2': [1, 2, 3, 4, 5]})
def someLogic(col_1, col_2):
if 'A' in col_1 and col_2 == 1:
return 111
elif "X" in col_1 and col_2 == 4:
return 999
return 888
df['NewCol'] = df.apply(lambda row: someLogic(row.col_1, row.col_2), axis=1, result_type="expand")
print(df)

Find counts between multiple columns in python [duplicate]

I have a dataframe like this
df = pd.DataFrame({'a' : [1,1,0,0], 'b': [0,1,1,0], 'c': [0,0,1,1]})
I want to get
a b c
a 2 1 0
b 1 2 1
c 0 1 2
where a,b,c are column names, and I get the values counting '1' in all columns when the filter is '1' in another column.
For ample, when df.a == 1, we count a = 2, b =1, c = 0 etc
I made a loop to solve
matrix = []
for name, values in df.iteritems():
matrix.append(pd.DataFrame( df.groupby(name, as_index=False).apply(lambda x: x[x == 1].count())).values.tolist()[1])
pd.DataFrame(matrix)
But I think that there is a simpler solution, isn't it?
You appear to want the matrix product, so leverage DataFrame.dot:
df.T.dot(df)
a b c
a 2 1 0
b 1 2 1
c 0 1 2
Alternatively, if you want the same level of performance without the overhead of pandas, you could compute the product with np.dot:
v = df.values
pd.DataFrame(v.T.dot(v), index=df.columns, columns=df.columns)
Or, if you want to get cute,
(lambda a, c: pd.DataFrame(a.T.dot(a), c, c))(df.values, df.columns)
a b c
a 2 1 0
b 1 2 1
c 0 1 2
—piRSquared
np.einsum
Not as pretty as df.T.dot(df) but how often do you see np.einsum amirite?
pd.DataFrame(np.einsum('ij,ik->jk', df, df), df.columns, df.columns)
a b c
a 2 1 0
b 1 2 1
c 0 1 2
You can do a multiplication using # operator for numpy arrays.
df = pd.DataFrame(df.values.T # df.values, df.columns, df.columns)
Numpy matmul
np.matmul(df.values.T,df.values)
Out[87]:
array([[2, 1, 0],
[1, 2, 1],
[0, 1, 2]], dtype=int64)
#pd.DataFrame(np.matmul(df.values.T,df.values), df.columns, df.columns)

Apply function rowwise to pandas dataframe while referencing a column

I have a pandas dataframe like this:
df = pd.DataFrame({'A': [2, 3], 'B': [1, 2], 'C': [0, 1], 'D': [1, 0], 'total': [4, 6]})
A B C D total
0 2 1 0 1 4
1 3 2 1 0 6
I'm trying to perform a rowwise calculation and create a new column with the result. The calculation is to divide each column ABCD by the total, square it, and sum it up rowwise. This should be the result (0 if total is 0):
A B C D total result
0 2 1 0 1 4 0.375
1 3 2 1 0 6 0.389
This is what I've tried so far, but it always returns 0:
df['result'] = df[['A', 'B', 'C', 'D']].apply(lambda x: ((x/df['total'])**2).sum(), axis=1)
I guess the problem is df['total'] in the lambda function, because if I replace this by a number it works fine. I don't know how to work around this though. Appreciate any suggestions.
A combination of div, pow and sum can solve this :
df["result"] = df.filter(regex="[^total]").div(df.total, axis=0).pow(2).sum(1)
df
A B C D total result
0 2 1 0 1 4 0.375000
1 3 2 1 0 6 0.388889
you could do
df['result'] = (df.loc[:, "A": 'D'].divide(df.total, axis=0) ** 2).sum(axis=1)

Pandas tuples groupby aggregation

Certain columns of my data frame contain tuples. Whenever I do aggregation via group by that columns do not appear in the resulting data frame unless explicitly specified.
Example,
df = pd.DataFrame()
df['A'] = [1, 2, 1, 2]
df['B'] = [1, 2, 3, 4]
df['C'] = map(lambda s: (s,), df['B'])
print df
A B C
0 1 1 (1,)
1 2 2 (2,)
2 1 3 (3,)
3 2 4 (4,)
If I do the following way then the column C does not appear in aggregation
print df.groupby('A').sum()
B
A
1 4
2 6
But if I specify it explicitly it appears as expected
print df[['A', 'C']].groupby('A').sum()
C
A
1 (1, 3)
2 (2, 4)
Could you please tell me why the C column didn't appear in the first case?
I would like it to go by default.
Because you aggregate by column B, not column C:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['A'] = [1, 2, 1, 2]
df['B'] = [1, 2, 3, 4]
df['C'] = map(lambda s: (s,), df['B'])
print df
df.at[0,'B'] = 10
print df
A B C
0 1 10 (1,)
1 2 2 (2,)
2 1 3 (3,)
3 2 4 (4,)
print df.groupby('A').sum()
B
A
1 13
2 6
print df.groupby('A')['B'].sum()
1 13
2 6
Name: B, dtype: int64

Categories