Count occurrences of row values in a given set with pandas - python

I have a dataframe similar to
a b c d e
0 36 38 27 12 35
1 45 33 8 41 18
4 32 14 4 14 9
5 43 1 31 11 3
6 16 8 3 17 39
...
and I want, for each row, to count the occurrences of values in a given set.
I came up with the following code (Python 3) which seems to work, but I'm looking for efficiency, since my real dataframe is much more complex and big:
import pandas as pd
import numpy as np
def column():
return [np.random.randint(0,49) for _ in range(20)]
df = pd.DataFrame({'a': column(),'b': column(),'c': column(),'d': column(),'e': column()})
given_set = {3,8,11,18,22,24,35,36,42,47}
def count_occurrences(row):
return sum(col in given_set for col in (row.a,row.b,row.c,row.d,row.e))
df['count'] = df.apply(count_occurrences, axis=1)
print(df)
Is there a way to obtain the same result with pandas vectorial operators? (instead of Python function)
Thanks in advance.

IIUC you can use DataFrame.isin() method:
Data:
In [41]: given_set = {3,8,11,18,22,24,35,36,42,47}
In [42]: df
Out[42]:
a b c d e
0 36 38 27 12 35
1 45 33 8 41 18
4 32 14 4 14 9
5 43 1 31 11 3
6 16 8 3 17 39
Solution:
In [44]: df['new'] = df.isin(given_set).sum(1)
In [45]: df
Out[45]:
a b c d e new
0 36 38 27 12 35 2
1 45 33 8 41 18 2
4 32 14 4 14 9 0
5 43 1 31 11 3 2
6 16 8 3 17 39 2
Explanation:
In [49]: df.isin(given_set)
Out[49]:
a b c d e
0 True False False False True
1 False False True False True
4 False False False False False
5 False False False True True
6 False True True False False
In [50]: df.isin(given_set).sum(1)
Out[50]:
0 2
1 2
4 0
5 2
6 2
dtype: int64
UPDATE: if you want check for existence instead of counting, you can do it this way (thanks to #DSM):
In [6]: df.isin(given_set).any(1)
Out[6]:
0 True
1 True
4 False
5 True
6 True
dtype: bool
In [7]: df.isin(given_set).any(1).astype(np.uint8)
Out[7]:
0 1
1 1
4 0
5 1
6 1
dtype: uint8

Related

df.apply() but skip the first row

I am trying to apply the following df.apply command to a dataframe but want it to skip the first row. Any advice on how to do that without setting the first row as the column headers?
res = sheet1[sheet1.apply(lambda row: row.astype(str).str.contains('TRUE', case=False).any(), axis=1)]
You can select from the index one and on as follow:
res = sheet1[1:].apply(lambda row: row.astype(str).str.contains('TRUE', case=False).any(), axis=1)
EDIT Version 3:
import pandas as pd
import random
df = pd.DataFrame({'a':range(1,11), 'b':range(2,21,2), 'c':range(1,20,2),
'd':['TRUE' if random.randint(0,1) else 'FALSE' for _ in range(10)]})
print (df)
res = df[df.apply(lambda row: row.astype(str).str.contains('TRUE', case=False).any(), axis=1)]
print (res.loc[1:])
If all you want to do is to get only the rows from 1 onwards, you can just do it as shown above:
The input Dataframe is:
a b c d
0 1 2 1 TRUE
1 2 4 3 FALSE
2 3 6 5 FALSE
3 4 8 7 TRUE
4 5 10 9 TRUE
5 6 12 11 TRUE
6 7 14 13 FALSE
7 8 16 15 FALSE
8 9 18 17 TRUE
9 10 20 19 TRUE
The output of res will be:
a b c d
0 1 2 1 TRUE
3 4 8 7 TRUE
4 5 10 9 TRUE
5 6 12 11 TRUE
8 9 18 17 TRUE
9 10 20 19 TRUE
The output of res[1:] - excluding first row will be:
a b c d
3 4 8 7 TRUE
4 5 10 9 TRUE
5 6 12 11 TRUE
8 9 18 17 TRUE
9 10 20 19 TRUE
EDIT Version 2:
Here's an example with 'TRUE' and 'FALSE' in the column.
import pandas as pd
import random
df = pd.DataFrame({'a':['TRUE' if random.randint(0,1) else 'FALSE' for _ in range(10)]})
print (df)
res = df.iloc[1:].apply(lambda row: row.astype(str).str.contains('TRUE', case=False).any(), axis=1)
print (res)
The output will be:
Original DataFrame:
a
0 TRUE
1 TRUE
2 FALSE
3 FALSE
4 FALSE
5 TRUE
6 TRUE
7 FALSE
8 FALSE
9 FALSE
Result from the DataFrame:
1 True
2 False
3 False
4 False
5 True
6 True
7 False
8 False
9 False
dtype: bool
You can also give loc instead of iloc:
res = df.loc[1:].apply(lambda row: row.astype(str).str.contains('TRUE', case=False).any(), axis=1)
As you can see, it skipped the first row.
Old answer
Here's an example:
import pandas as pd
df = pd.DataFrame({'a':range(1,11), 'b':range(2,21,2), 'c':range(1,20,2)})
print (df)
res = df.iloc[1:,:].apply(lambda x: x+10,axis=1)
print (res)
Original DataFrame:
a b c
0 1 2 1
1 2 4 3
2 3 6 5
3 4 8 7
4 5 10 9
5 6 12 11
6 7 14 13
7 8 16 15
8 9 18 17
9 10 20 19
Only rows 1 onwards got modified:
a b c
1 12 14 13
2 13 16 15
3 14 18 17
4 15 20 19
5 16 22 21
6 17 24 23
7 18 26 25
8 19 28 27
9 20 30 29

Pandas sum of variable number of columns

I have a pandas dataframe like this -
Time 1 A 2 A 3 A 4 A 5 A 6 A 100 A
5 10 4 6 6 4 6 4
3 7 19 2 7 7 9 18
6 3 6 3 3 8 10 56
2 5 9 1 1 9 12 13
The Time columns gives me the number of A columns that I need to sum up.So that the output looks like this -
Time 1 A 2 A 3 A 4 A 5 A 6 A 100 A Total
5 10 4 6 6 4 6 4 30
3 7 19 2 7 7 9 18 28
6 3 6 3 3 8 10 56 33
2 5 9 1 1 9 12 13 14
In other words, when the value in Time column is 3, it should sum up 1A, 2A and 3A
when the value in Time column is 5, it should sum up 1A, 2A, 3A, 4A and 5A
Note: There are other columns also in between the As. So I cant sum using simple indexing.
Highly appreciate any help in finding a solution.
Use numpy - idea is compare array created by np.arange with length of columns with Time columns converted to index with broadcasting to 2d mask, get matched values by numpy.where and last sum:
df1 = df.set_index('Time')
m = np.arange(len(df1.columns)) < df1.index.values[:, None]
df['new'] = np.where(m, df1.values, 0).sum(axis=1)
print (df)
Time 1 A 2 A 3 A 4 A 5 A 6 A 100 A new
0 5 10 4 6 6 4 6 4 30
1 3 7 19 2 7 7 9 18 28
2 6 3 6 3 3 8 10 56 33
3 2 5 9 1 1 9 12 13 14
Details:
print (df1)
1 A 2 A 3 A 4 A 5 A 6 A 100 A
Time
5 10 4 6 6 4 6 4
3 7 19 2 7 7 9 18
6 3 6 3 3 8 10 56
2 5 9 1 1 9 12 13
print (m)
[[ True True True True True False False]
[ True True True False False False False]
[ True True True True True True False]
[ True True False False False False False]]
print (np.where(m, df1.values, 0))
[[10 4 6 6 4 0 0]
[ 7 19 2 0 0 0 0]
[ 3 6 3 3 8 10 0]
[ 5 9 0 0 0 0 0]]
Try:
df['total'] = df.apply(lambda x: sum([x[i+1] for i in range(x['Time'])]), axis=1)

i have to check multiple columns are in range of remaing two columns

i have a pandas dataframe like
dd1=
A B C D E F
10 18 13 11 9 25
0 32 27 3 18 28
4 6 3 29 2 23
and i want to check columns A to D are between the range of Columns E and F.
I want output like that is if it is in range in result column 0 otherwise that value which is out of range..
dd1=
A B C D E F Result
10 18 13 11 9 25 0
0 32 27 3 18 28 [0,32]
4 6 3 29 2 23 29
i tried like these:
dd1=dd1.loc[dd1.iloc[:,0:3].between(dd1['E'],dd6['F'])]
Use:
df1 = df.iloc[:, :4]
m1 = df1.gt(df['E'], axis=0)
m2 = df1.lt(df['F'], axis=0)
df['Result'] = df1.mask(m1 & m2).apply(lambda x: x.dropna().astype(int).tolist(), axis=1)
print (df)
A B C D E F Result
0 10 18 13 11 9 25 []
1 0 32 27 3 18 28 [0, 32, 3]
2 4 6 3 29 2 23 [29]
If need 0 for non matched values:
s = df1.mask(m1 & m2).apply(lambda x: x.dropna().astype(int).tolist(), axis=1)
df['Result'] = np.where(s.astype(bool), s, 0)
print (df)
A B C D E F Result
0 10 18 13 11 9 25 0
1 0 32 27 3 18 28 [0, 32, 3]
2 4 6 3 29 2 23 [29]

Pandas - Remove row if a specific value if repeated in a column and keep first

Imagine we have a dataframe:
num line
0 1 56
1 1 90
2 2 66
3 3 4
4 3 55
5 3 104
6 1 23
7 5 22
8 3 144
I want to remove the rows where specifically a 3 is repeated in the num column, and keep the first. So the two rows with repeating 1's in the num column should still be in the resulting DataFrame together with all the other columns.
What I have so far, which removes every double value, not only the 3's:
data.groupby((data['num'] != data['num'].shift()).cumsum().values).first()
Expected result or correct code:
num line
0 1 56
1 1 90
2 2 66
3 3 4
4 1 23
5 5 22
6 3 144
Use:
df = data[data['num'].ne(3) | data['num'].ne(data['num'].shift())]
print (df)
num line
0 1 56
1 1 90
2 2 66
3 3 4
6 1 23
7 5 22
8 3 144
Detail:
Compare for not equal:
print (data['num'].ne(3))
0 True
1 True
2 True
3 False
4 False
5 False
6 True
7 True
8 False
Name: num, dtype: bool
Compare by shifted values for first consecutive:
print (data['num'].ne(data['num'].shift()))
0 True
1 False
2 True
3 True
4 False
5 False
6 True
7 True
8 True
Name: num, dtype: bool
Chain by | for bitwise OR:
print (data['num'].ne(3) | data['num'].ne(data['num'].shift()))
0 True
1 True
2 True
3 True
4 False
5 False
6 True
7 True
8 True
Name: num, dtype: bool
You could use the bellow conditions in order to perform boolean indexation in the dataframe:
# True where num is 3
c1 = df['num'].eq(3)
# True where num is repeated
c2 = df['num'].eq(df['num'].shift(1))
# boolean indexation on df
df[(c1 & ~c2) | ~(c1)]
num line
0 1 56
1 1 90
2 2 66
3 3 4
6 1 23
7 5 22
8 3 144
Details
df.assign(is_3=c1, is_repeated=c2, filtered=(c1 & ~c2) | ~(c1))
num line is_3 is_repeated filtered
0 1 56 False False True
1 1 90 False True True
2 2 66 False False True
3 3 4 True False True
4 3 55 True True False
5 3 104 True True False
6 1 23 False False True
7 5 22 False False True
8 3 144 True False True

How to drop rows with respect to a column values in Python?

I wantto remove rows with respect ID column values.
df
ID B C D
0 101 1 2 3
1 103 5 6 7
2 108 9 10 11
3 109 5 3 12
4 118 11 15 2
5 121 2 5 6
Here the remove_id list of ID value those I want to remove.
remove_id = [103,108, 121]
I want to output like following:
df
ID B C D
0 101 1 2 3
3 109 5 3 12
4 118 11 15 2
How can I do this?
You can check which IDs are in remove_id with the isin method, negate the result with ~ and use the resulting Series for boolean indexing.
>>> df[~df['ID'].isin(remove_id)]
>>>
ID B C D
0 101 1 2 3
3 109 5 3 12
4 118 11 15 2
Details:
>>> df['ID'].isin(remove_id)
>>>
0 False
1 True
2 True
3 False
4 False
5 True
Name: ID, dtype: bool
>>> ~df['ID'].isin(remove_id)
>>>
0 True
1 False
2 False
3 True
4 True
5 False
Name: ID, dtype: bool

Categories