subsetting by two conditions (True & False) evaluating to (True) - python

import pandas as pd
d = {'col1':[1, 2, 3, 4, 5], 'col2':[5, 4, 3, 2, 1]}
df = pd.DataFrame(data=d)
df[(df['col1'] == 1) | (df['col1'] == df['col1'].max()) & (df['col1'] > 2)]
Why doesn't this filter out the first row? Where col1 is less than 2?
I'm getting this:
col1 col2
0 1 5
4 5 1
Expecting this:
col1 col2
4 5 1

Per the first comment (thanks chepener!), this solved it:
df[((df['col1'] == 1) | (df['col1'] == df['col1'].max())) & (df['col1'] > 2)]

Related

Add another column based on the value of two columns

I am trying to add another column based on the value of two columns. Here is the mini version of my dataframe.
data = {'current_pair': ['"["StimusNeu/2357.jpg","StimusNeu/5731.jpg"]"', '"["StimusEmo/6350.jpg","StimusEmo/3230.jpg"]"', '"["StimusEmo/3215.jpg","StimusEmo/9570.jpg"]"','"["StimusNeu/7020.jpg","StimusNeu/7547.jpg"]"', '"["StimusNeu/7080.jpg","StimusNeu/7179.jpg"]"'],
'B': [1, 0, 1, 1, 0]
}
df = pd.DataFrame(data)
df
current_pair B
0 "["StimusNeu/2357.jpg","StimusNeu/5731.jpg"]" 1
1 "["StimusEmo/6350.jpg","StimusEmo/3230.jpg"]" 0
2 "["StimusEmo/3215.jpg","StimusEmo/9570.jpg"]" 1
3 "["StimusNeu/7020.jpg","StimusNeu/7547.jpg"]" 1
4 "["StimusNeu/7080.jpg","StimusNeu/7179.jpg"]" 0
I want the result to be:
current_pair B C
0 "["StimusNeu/2357.jpg","StimusNeu/5731.jpg"]" 1 1
1 "["StimusEmo/6350.jpg","StimusEmo/3230.jpg"]" 0 2
2 "["StimusEmo/3215.jpg","StimusEmo/9570.jpg"]" 1 0
3 "["StimusNeu/7020.jpg","StimusNeu/7547.jpg"]" 1 1
4 "["StimusNeu/7080.jpg","StimusNeu/7179.jpg"]" 0 2
I used the numpy select commands:
conditions=[(data['B']==1 & data['current_pair'].str.contains('Emo/', na=False)),
(data['B']==1 & data['current_pair'].str.contains('Neu/', na=False)),
data['B']==0]
choices = [0, 1, 2]
data['C'] = np.select(conditions, choices, default=np.nan)
Unfortunately, it gives me this dataframe without recognizing anything with "1" in column "C".
current_pair B C
0 "["StimusNeu/2357.jpg","StimusNeu/5731.jpg"]" 1 0
1 "["StimusEmo/6350.jpg","StimusEmo/3230.jpg"]" 0 2
2 "["StimusEmo/3215.jpg","StimusEmo/9570.jpg"]" 1 0
3 "["StimusNeu/7020.jpg","StimusNeu/7547.jpg"]" 1 0
4 "["StimusNeu/7080.jpg","StimusNeu/7179.jpg"]" 0 2
Any help counts! thanks a lot.
There is problem with () after ==1 for precedence of operators:
conditions=[(data['B']==1) & data['current_pair'].str.contains('Emo/', na=False),
(data['B']==1) & data['current_pair'].str.contains('Neu/', na=False),
data['B']==0]
I think some logic went wrong here; this works:
df.assign(C=np.select([df.B==0, df.current_pair.str.contains('Emo/'), df.current_pair.str.contains('Neu/')], [2,0,1]))
Here is a slightly more generalized suggestion, easily applicable to more complex cases. You should, however mind execution speed:
import pandas as pd
df = pd.DataFrame({'col_1': ['Abc', 'Xcd', 'Afs', 'Xtf', 'Aky'], 'col_2': [1, 2, 3, 4, 5]})
def someLogic(col_1, col_2):
if 'A' in col_1 and col_2 == 1:
return 111
elif "X" in col_1 and col_2 == 4:
return 999
return 888
df['NewCol'] = df.apply(lambda row: someLogic(row.col_1, row.col_2), axis=1, result_type="expand")
print(df)

Pandas: how to Identify lines with specific value in column x, and use other values in same line as variables?

I would like to identify a row containing 1 in column x, and use other values in the same row as variables. Then I would like to identify a row in a different dataframe which contains the variables, and delete the row.
df1:
z y x
--------
3 3 0
5 4 1
df2:
a b c d e
--------------
5 4 p p p <-- Delete this row
3 3 p p p
This is one way of doing it, if you are looking for a scalable version, with more than just 2 rows in df1 and df2:
import pandas as pd
df1 = pd.DataFrame({"z":[3,5], "y": [3,4], "x": [0,1]})
df2 = pd.DataFrame({"a":[5,4], "b":[4,3], "c": ["p","p"], "d": ["p","p"], "e": ["p","p"]})
df1_set = df1[df1.x == 1]
idx = []
for i in range(len(df2)):
for j in range(len(df1_set)):
if (df2.a.iloc[i], df2.b.iloc[i]) == (df1_set.z.iloc[j], df1_set.y.iloc[j]):
idx.append(i)
df2_set = df2.drop(idx)
The dataframes look like this:
df1_set
Out[50]:
z y x
1 5 4 1
df2_set
Out[51]:
a b c d e
1 4 3 p p p
Here df1_set is the selective df from df1, which has value of x = 1, and then df2_set if the final output that you are seeking.
Explanation:
Find out the rows with x = 1 in df1
Run loops on df1 and df2 to
remove each element in df2 where the set z and y from df1_set
matches a and b from df2
For both DataFrame you can create the column that will be used to match both DataFrame together, you need a column that contains a tuple with the different values
Then remove from the main dataframe, the rows where (a,b) is contained in the valus that com from the second dataframe
df_idx = DataFrame({'z': [3, 5], 'y': [3, 4], 'x': [0, 1]})
df = DataFrame({'a': [5, 3], 'b': [4, 3], 'c': ['A', 'A'], 'd': ['A', 'A'], 'e': ['A', 'A']})
df_idx['match'] = df_idx[['z', 'y']].apply(tuple, axis=1)
df['match'] = df[['a', 'b']].apply(tuple, axis=1)
selectors = df_idx[df_idx.x == 1]
result = df[~df.match.isin(selectors['match'].tolist())]
>>> print(df_idx)
z y x match
0 3 3 0 (3, 3)
1 5 4 1 (5, 4)
>>> print(df)
a b c d e match
0 5 4 A A A (5, 4)
1 3 3 A A A (3, 3)
>>> print(selectors['match'].tolist())
[(5, 4)]
>>> print(result)
a b c d e match
1 3 3 A A A (3, 3)

Extract data from data frame based on two criteria

Given the following example.
d = {'col1': [1, 2, 3], 'col2': [6, 7]}
df = pd.DataFrame(data=d)
df
col1 col2
0 1 6
1 2 7
newdf[df['col1' ==2]
newdf
col1 col2
0 2 7
Works just fine for single col
but
newdf[df['col1' ==2 & 'col2' == 7]
I win error prize.
You have a typo in your statement.
The logical and operator in python is
and
Your statement should be
>>> newdf[df[('col1' == 2) & ('col2' == 7)]
Thanks #Trenton for the remark.
None of the following are correct
newdf[df['col1' ==2]
newdf[df['col1' ==2 & 'col2' == 7]
newdf[df['col1' == 2 && 'col2' == 7]
Parenthesis must be around each condition
Pandas: Boolean indexing
import pandas as pd
d = {'col1': [1, 2, 3, 2], 'col2': [6, 7, 8, 9]}
df = pd.DataFrame(data=d)
col1 col2
0 1 6
1 2 7
2 3 8
3 2 9
# specify multiple conditions
newdf = df[(df.col1 == 2) & (df.col2 == 7)]
print(newdf)
col1 col2
1 2 7

Add sequential counter to group within dataframe but skip increment when condition is met

I am looking to setup an incremental counter with a group in a dataframe. I would like to increase the counter for each row within the group unless a condition is met. If the condition is met I want to use the previous count. I also want this to reset for every group.
example:
d1 = {'col1': [1, 1, 1, 2, 2, 3], 'col2': ['A', 'A', 'B', 'A', 'A', 'B']}
df1 = pd.DataFrame(data=d1)
df1
output:
col1 col2
0 1 A
1 1 A
2 1 B
3 2 A
4 2 A
5 3 B
expected output:
col1 col2 count
0 1 A 1
1 1 A 2
2 1 B 2
3 2 A 1
4 2 A 2
5 3 B 0
I have tried using numpy cumsum. But I am not really sure how to reuse the last cumsum
Edit:
Looking to Group by Column 1.
I made a code snippet following what I believe is what you want, you can definitely reuse to adapt if something is not really exactly as you expected.
I think the key thing here is:
1) iterate on the pairs of (previousRow, currentRow) so you can easily acess last row information
2) specific if conditions that matches what you expect.
3) try to update the count in the if conditions and set the value afterwards
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from itertools import zip_longest
d1 = {'col1': [1, 1, 1, 2, 2, 3], 'col2': ['A', 'A', 'B', 'A', 'A', 'B']}
df1 = pd.DataFrame(data=d1)
df1['count'] = 0
df1_previterrows = df1.iterrows()
df1_curriterrows = df1.iterrows()
df1_curriterrows.__next__()
groups_counter = {}
df1_firstRow = df1.iloc[0]
if df1_firstRow["col2"] == "A":
groups_counter[df1_firstRow['col1']]=1
df1.set_value(0, 'count', 1)
elif df1_firstRow["col2"] == "B":
groups_counter["B"]=1
df1.set_value(0, 'count', 0)
zip_list = zip_longest(df1_previterrows, df1_curriterrows)
for (prevRow_idx, prevRow), Curr in zip_list:
if not (Curr is None):
(currRow_idx, currRow) = Curr
if((currRow["col1"] == prevRow["col1"]) and (currRow["col2"] == "A")):
count = groups_counter.get(currRow["col1"],False)
if not count:
groups_counter[currRow["col1"]]=0
groups_counter[currRow["col1"]]+=1
elif((currRow["col1"] != prevRow["col1"]) and (currRow["col2"] == "A")):
groups_counter[currRow["col1"]]=1
elif((currRow["col1"] == prevRow["col1"]) and (currRow["col2"] == "B")):
if not groups_counter.get(currRow["col1"],False):
groups_counter[curr["col1"]] = 1
elif((currRow["col1"] != prevRow["col1"]) and (currRow["col2"] == "B")):
groups_counter[currRow["col1"]]=0
df1.set_value(currRow_idx, 'count', groups_counter[currRow["col1"]])
print(df1)
OUTPUT:
col1 col2 count
0 1 A 1
1 1 A 2
2 1 B 2
3 2 A 1
4 2 A 2
5 3 B 0

Finding elements in a pandas dataframe

I have a pandas dataframe which looks like the following:
0 1
0 2
2 3
1 4
What I want to do is the following: if I get 2 as input my code is supposed to search for 2 in the dataframe and when it finds it returns the value of the other column. In the above example my code would return 0 and 3. I know that I can simply look at each row and check if any of the elements is equal to 2 but I was wondering if there is one-liner for such a problem.
UPDATE: None of the columns are index columns.
Thanks
>>> df = pd.DataFrame({'A': [0, 0, 2, 1], 'B': [1,2,3,4]})
>>> df
A B
0 0 1
1 0 2
2 2 3
3 1 4
The following pandas syntax is equivalent to the SQL SELECT B FROM df WHERE A = 2
>>> df[df['A'] == 2]['B']
2 3
Name: B, dtype: int64
There's also pandas.DataFrame.query:
>>> df.query('A == 2')['B']
2 3
Name: B, dtype: int64
You may need this:
n_input = 2
df[(df == n_input).any(1)].stack()[lambda x: x != n_input].unique()
# array([0, 3])
df = pd.DataFrame({'A': [0, 0, 2, 1], 'B': [1,2,3,4]})
t = [df.loc[lambda df: df['A'] == 3]]
t

Categories