looping and with if statement over dataframe - python

I'm running into an issue when iterating over rows in a pandas data frame
this is the code I am trying to run
data = {'test':[1,1,0,0,3,1,0,3,0],
'test2':[0, 2, 0,1,1,2,7,3,2],
}
df = pd.DataFrame(data)
df['combined'] = df['test'] +df['test2']
df['combined'].astype('float64')
df
for index, row in df.iterrows():
if row['test']>=1 & row['test2']>=1:
row['combined']/=2
else:
pass
so, it should divide by 2 if both test and test2 have a value of 1 or more, however it doesn't divide all the rows that should be divided.
am I making a mistake somewhere?
this is the outcome when I run the code
corresponding columns are test, test2 and combined
0 1 0 1
1 1 2 3
2 0 0 0
3 0 1 1
4 3 1 2
5 1 2 3
6 0 7 7
7 3 3 3
8 0 2 2

You are using &, the bitwise AND operator. You should be using and, the boolean AND operator. This is causing the if statement to give an answer you don't expect.

What you are doing is in general a bad practice as iterating the rows should be avoided for performance reasons if is not strictly necessary, the solution is defining mask with your conditions and operate within the mask using .loc:
data = {'test':[1,1,0,0,3,1,0,3,0],
'test2':[0, 2, 0,1,1,2,7,3,2],
}
df = pd.DataFrame(data)
df['combined'] = df['test'] +df['test2']
df['combined'].astype('float64')
mask = (df['test']>=1) & (df['test2']>=1)
df.loc[mask,'combined'] /=2

Related

How to remove rows with multiple occurrences in a row with pandas

i have this data:
A
1 1
2 1
3 1
4 2
5 2
6 1
i expect to get:
A
1 1
- - -> (drop)
3 1
4 2
5 2
6 1
I want to drop all the rows in col ['A'] with the same value that appear in a row,
but without the first and the last ones.
Until now I used:
df = df.loc[df[col].shift() != df[col]]
but it will remove also the last appearance.
Sorry for my bad English, thanks in advance.
Looks like you have the same problem as this question: Pandas drop_duplicates. Keep first AND last. Is it possible?.
The suggested solution is:
pd.concat([
df['A'].drop_duplicates(keep='first'),
df['A'].drop_duplicates(keep='last'),
])
Update after clarification:
First get the boolean masks for your described criteria:
is_last = df['A'] != df['A'].shift(-1)
is_duplicate = df['A'] == df['A'].shift()
And drop the rows based on these:
df.drop(df.index[~is_last & is_duplicate]) # note the ~ to negate is_last
Basically you need to group consecutive numbers, which can be achieved by diff and cumsum:
print (df.groupby(df["A"].diff().ne(0).cumsum(), as_index=False).nth([0, -1]))
A
1 1
3 1
4 2
5 2
6 1

How to drop conflicted rows in Dataframe?

I have a cliassification task, which means the conflicts harm the performance, i.e. same feature but different label.
idx feature label
0 a 0
1 a 1
2 b 0
3 c 1
4 a 0
5 b 0
How could I get formated dataframe as below?
idx feature label
2 b 0
3 c 1
5 b 0
Dataframe.duplicated() only output the duplicated rows, it seems the logic operation between df["features"].duplicated() and df.duplicated() do not return the results I want.
I think you need rows with only one unique value per groups - so use GroupBy.transform with DataFrameGroupBy.nunique, compare by 1 and filter in boolean indexing:
df = df[df.groupby('feature')['label'].transform('nunique').eq(1)]
print (df)
idx feature label
2 2 b 0
3 3 c 1
5 5 b 0

Merge Pandas Dataframe based on boolean function

I am looking for an efficient way to merge two pandas data frames based on a function that takes as input columns from both data frames and returns True or False. E.g. Assume I have the following "tables":
import pandas as pd
df_1 = pd.DataFrame(data=[1, 2, 3])
df_2 = pd.DataFrame(data=[4, 5, 6])
def validation(a, b):
return ((a + b) % 2) == 0
I would like to join df1 and df2 on each row where the sum of the first column is an even number. The resulting table would be
1 5
df_3 = 2 4
2 6
3 5
Please think of it as a general problem not as a task to return just df_3. The solution should accept any function that validates a combination of columns and return True or False.
THX Lazloo
You can do with merge on parity:
(df_1.assign(parity=df_1[0]%2)
.merge(df_2.assign(parity=df_2[0]%2), on='dummy')
.drop('parity', axis=1)
)
output:
0_x 0_y
0 1 5
1 3 5
2 2 4
3 2 6
You can use broadcasting, or the outer functions, to compare all rows. You'll run into issues as the length becomes large.
import pandas as pd
import numpy as np
def validation(a, b):
"""a,b : np.array"""
arr = np.add.outer(a, b) # How to combine rows
i,j = np.where(arr % 2 == 0) # Condition
return pd.DataFrame(np.stack([a[i], b[j]], axis=1))
validation(df_1[0].to_numpy(), df_2[0].to_numpy())
0 1
0 1 5
1 2 4
2 2 6
3 3 5
In this particular case you might leverage the fact that even numbers maintain parity when added to even numbers, and odd numbers change parity when added to odd numbers, so define that column and merge on that.
df_1['parity'] = df_1[0]%2
df_2['parity'] = df_2[0]%2
df_3 = df_1.merge(df_2, on='parity')
0_x parity 0_y
0 1 1 5
1 3 1 5
2 2 0 4
3 2 0 6
This is a basic solution but not very efficient if you are working on large dataframes
df_1.index *= 0
df_2.index *= 0
df = df_1.join(df_2, lsuffix='_2')
df = df[df.sum(axis=1) % 2 == 0]
Edit,
here is a better solution
df_1.index = df_1.iloc[:,0] % 2
df_2.index = df_2.iloc[:,0] % 2
df = df_1.join(df_2, lsuffix='_2')

Pandas Sum & Count Across Only Certain Columns

I have just started learning pandas, and this is a very basic question. Believe me, I have searched for an answer, but can't find one.
Can you please run this python code?
import pandas as pd
df = pd.DataFrame({'A':[1,0], 'B':[2,4], 'C':[4,4], 'D':[1,4],'count__4s_abc':[1,2],'sum__abc':[7,8]})
df
How do I create column 'count__4s_abc' in which I want to count how many times the number 4 appears in just columns A-C? (While ignoring column D.)
How do I create column 'sum__abc' in which I want to sum the amounts in just columns A-C? (While ignoring column D.)
Thanks much for any help!
Using drop
df.assign(
count__4s_abc=df.drop('D', 1).eq(4).sum(1),
sum__abc=df.drop('D', 1).sum(1)
)
Or explicitly choosing the 3 columns.
df.assign(
count__4s_abc=df[['A', 'B', 'C']].eq(4).sum(1),
sum__abc=df[['A', 'B', 'C']].sum(1)
)
Or using iloc to get first 3 columns.
df.assign(
count__4s_abc=df.iloc[:, :3].eq(4).sum(1),
sum__abc=df.iloc[:, :3].sum(1)
)
All give
A B C D count__4s_abc sum__abc
0 1 2 4 1 1 7
1 0 4 4 4 2 8
One additional option:
In [158]: formulas = """
...: new_count__4s_abc = (A==4)*1 + (B==4)*1 + (C==4)*1
...: new_sum__abc = A + B + C
...: """
In [159]: df.eval(formulas)
Out[159]:
A B C D count__4s_abc sum__abc new_count__4s_abc new_sum__abc
0 1 2 4 1 1 7 1 7
1 0 4 4 4 2 8 2 8
DataFrame.eval() method can (but not always) be faster compared to regular Pandas arithmetic

Pandas: Combination of two data frames

I have two data frames, old and new. Both have identical columns.
I want to, by the index,
Add rows to old that exist in new but not in old
Update rows at old with data in new.
Is there any efficient way of doing so in pandas? I found update(), which does exactly the second step. However, it doesn't add rows. I could do the first step using some set logic onto the indices. However, that does not appear to efficient. What's the best way to do these two operations?
Example
old
a b
0 1 1
1 3 3
new
a b
1 1 2
2 1 2
result
a b
0 1 1
1 1 2
2 1 2
You could first find common indices for both dataframes then for first with that indices assign values of the second. And then you'll get the result with combine_first:
In [35]: df1
Out[35]:
a b
0 1 1
1 3 3
In [36]: df2
Out[36]:
a b
1 1 2
2 1 2
idx = df1.index & df2.index
df1.loc[idx, :] = df2.loc[idx, :]
df1 = df1.combine_first(df2)
In [39]: df1
Out[39]:
a b
0 1 1
1 1 2
2 1 2
you can do the first step using df.reindex()method
old = old.reindex(index=new.index)

Categories