How to drop rows with respect to a column values in Python? - python

I wantto remove rows with respect ID column values.
df
ID B C D
0 101 1 2 3
1 103 5 6 7
2 108 9 10 11
3 109 5 3 12
4 118 11 15 2
5 121 2 5 6
Here the remove_id list of ID value those I want to remove.
remove_id = [103,108, 121]
I want to output like following:
df
ID B C D
0 101 1 2 3
3 109 5 3 12
4 118 11 15 2
How can I do this?

You can check which IDs are in remove_id with the isin method, negate the result with ~ and use the resulting Series for boolean indexing.
>>> df[~df['ID'].isin(remove_id)]
>>>
ID B C D
0 101 1 2 3
3 109 5 3 12
4 118 11 15 2
Details:
>>> df['ID'].isin(remove_id)
>>>
0 False
1 True
2 True
3 False
4 False
5 True
Name: ID, dtype: bool
>>> ~df['ID'].isin(remove_id)
>>>
0 True
1 False
2 False
3 True
4 True
5 False
Name: ID, dtype: bool

Related

how to perform math function(mod) in python to only cells with numeric data and skip empty cells

How do I perform mod(%4) to only cells with data and skip empty cells. it can't perform mod to empty cells so I want to skip empty cells.
0 1 2 3 4 5 6 7 8 9 10
A 23 23 10 54 23
B 34 12 34 54
C 98 76
D 11 12 12
E 14 23
I tried below code but when a certain column finds empty cell it doesn't perform any action.
for i in split.iloc[:,2:]:
if not np.where(split[i]==''):
split[i] =split[i].astype(int)
split[i]=split[i]%4
else:
continue
---------------------------------------------------------------
actual output
0 1 2 3 4 5 6 7 8 9 10
A 3 3 10 54 23
B 2 0 34 54
C 2 0
D 3 0 12
E 2 3
---------------------------------------------------------------
output expected
0 1 2 3 4 5 6 7 8 9 10
A 3 3 2 2 3
B 2 0 2 2
C 2 0
D 3 0 0
E 2 3
Assign based on condition:
df = pd.DataFrame({'1':['',1,2,3,''], '2':[4,7,8,'','']})
a b
0 4
1 1 7
2 2 8
3 3
4
df[lambda x: x!=''] = df[lambda x: x!=''] % 4
a b
0 0
1 1 3
2 2 0
3 3
4

Pandas - Remove row if a specific value if repeated in a column and keep first

Imagine we have a dataframe:
num line
0 1 56
1 1 90
2 2 66
3 3 4
4 3 55
5 3 104
6 1 23
7 5 22
8 3 144
I want to remove the rows where specifically a 3 is repeated in the num column, and keep the first. So the two rows with repeating 1's in the num column should still be in the resulting DataFrame together with all the other columns.
What I have so far, which removes every double value, not only the 3's:
data.groupby((data['num'] != data['num'].shift()).cumsum().values).first()
Expected result or correct code:
num line
0 1 56
1 1 90
2 2 66
3 3 4
4 1 23
5 5 22
6 3 144
Use:
df = data[data['num'].ne(3) | data['num'].ne(data['num'].shift())]
print (df)
num line
0 1 56
1 1 90
2 2 66
3 3 4
6 1 23
7 5 22
8 3 144
Detail:
Compare for not equal:
print (data['num'].ne(3))
0 True
1 True
2 True
3 False
4 False
5 False
6 True
7 True
8 False
Name: num, dtype: bool
Compare by shifted values for first consecutive:
print (data['num'].ne(data['num'].shift()))
0 True
1 False
2 True
3 True
4 False
5 False
6 True
7 True
8 True
Name: num, dtype: bool
Chain by | for bitwise OR:
print (data['num'].ne(3) | data['num'].ne(data['num'].shift()))
0 True
1 True
2 True
3 True
4 False
5 False
6 True
7 True
8 True
Name: num, dtype: bool
You could use the bellow conditions in order to perform boolean indexation in the dataframe:
# True where num is 3
c1 = df['num'].eq(3)
# True where num is repeated
c2 = df['num'].eq(df['num'].shift(1))
# boolean indexation on df
df[(c1 & ~c2) | ~(c1)]
num line
0 1 56
1 1 90
2 2 66
3 3 4
6 1 23
7 5 22
8 3 144
Details
df.assign(is_3=c1, is_repeated=c2, filtered=(c1 & ~c2) | ~(c1))
num line is_3 is_repeated filtered
0 1 56 False False True
1 1 90 False True True
2 2 66 False False True
3 3 4 True False True
4 3 55 True True False
5 3 104 True True False
6 1 23 False False True
7 5 22 False False True
8 3 144 True False True

Compare Dataframe column with numpy ndarray and update value in dataframe

Dateframe df:A B C D E
1 2 4 6 #Value to be updated for this column
12 34 5 54
4 8 12 4
3 5 6 2
5 7 11 27
numpy ndarray(shape(4*1)):
npar= ([12]
[6]
[2]
[27]
)
I have above dataframe df and array npar, I want to compare value of column D in array npar. if value of column D is found in array npar anywhere . I want to update column E with 1 else 0 for that row of dataframe df. Kindly suggest how I can do this with sample code.
You need isin, but first is necessery flatten array by numpy.ravel and last convert boolean mask to integers - Trues are 1s and Falses are 0s:
df['E'] = df.D.isin(npar.ravel()).astype(int)
print (df)
A B C D E
0 1 2 4 6 1
1 12 34 5 54 0
2 4 8 12 4 0
3 3 5 6 2 1
4 5 7 11 27 1
Detail:
npar = np.array([[12],[6],[2],[27]])
print (npar)
[[12]
[ 6]
[ 2]
[27]]
print (npar.ravel())
[12 6 2 27]
print (df.D.isin(npar.ravel()))
0 True
1 False
2 False
3 True
4 True
Name: D, dtype: bool

Assign index value by bool vector get confusing result

Here is the dataframe:
import pandas as pd
df = pd.DataFrame({'a':[2,5,6,1,2,7,2,0,2,4]})
I want to replace the elements that are larger than 3, so I do this:
df['a'][df['a']>3] = [9,99,999,9999]
I got the wrong result:
a
0 2
1 99
2 999
3 1
4 2
5 99
6 2
7 0
8 2
9 99
The result that I want is this:
a
0 2
1 9
2 99
3 1
4 2
5 999
6 2
7 0
8 2
9 9999
However, when the replacement is strings, the result seems right:
df['a'][df['a']>3] = ['a','b','c','d']
a
0 2
1 a
2 b
3 1
4 2
5 c
6 2
7 0
8 2
9 d
Why this happens and how can I make it right? Thanks!
Don't use chained indexing. Try this:
df.loc[df['a']>3, 'a'] = [9,99,999,9999]
# a
# 0 2
# 1 9
# 2 99
# 3 1
# 4 2
# 5 999
# 6 2
# 7 0
# 8 2
# 9 9999

Combining two datasets to form a boolean column (pandas)

I have two DataFrames in pandas:
dfm_one
data group_a group_b
0 3 a z
1 1 a z
2 2 b x
3 0 b x
4 0 b x
5 1 b z
6 0 c x
7 0 c y
8 3 c z
9 3 c z
dfm_two
data group_a group_b
0 4 a x
1 4 a y
2 4 b x
3 4 b x
4 4 b y
5 1 b y
6 1 b z
7 1 c x
8 4 c y
9 3 c z
10 2 c z
As output I want a boolean column that indicates for dfm_one whether there is a matching data entry (i.e. has the same vale) in dfm_two for each group_a group_b combination.
So my expected output is:
0 False
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 True
9 True
I'm guessing the code should look something like:
dfm_one.groupby(['group_a','group_b']).apply(lambda x: ??)
and that the function inside apply should make use of the isin method.
Another solution might be to merge the two datasets but I think this is not trivial since there is no unique identifier in the DataFrame.
OK this is a slight hack, if we cast the df to str dtype then we can call sum to concatenate the rows into a string, we can use the resultant string as a kind of unique identifier and then call isin on the other df, again converting to a str:
In [91]:
dfm_one.astype(str).sum(axis=1).isin(dfm_two.astype(str).sum(axis=1))
Out[91]:
0 False
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 True
9 True
dtype: bool
Output from the conversions:
In [92]:
dfm_one.astype(str).sum(axis=1)
Out[92]:
0 3az
1 1az
2 2bx
3 0bx
4 0bx
5 1bz
6 0cx
7 0cy
8 3cz
9 3cz
dtype: object
In [93]:
dfm_two.astype(str).sum(axis=1)
Out[93]:
0 4ax
1 4ay
2 4bx
3 4bx
4 4by
5 1by
6 1bz
7 1cx
8 4cy
9 3cz
10 2cz
dtype: object

Categories