For this dataframe
df
basketID productCode
0 1 23
1 1 24
2 1 25
3 2 23
4 3 23
5 4 25
6 5 24
7 5 25
Gives as expected
(df['productCode']) == 23
0 True
1 False
2 False
3 True
4 True
5 False
6 False
7 False
But if I want both 23 and 1
(df['productCode']) == 23 & (df['basketID'] == 1)
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
Everything is false.
Why first line was not recognized?
You need ) after 23 because operator precedence:
(df['productCode'] == 23) & (df['basketID'] == 1)
Related
I am trying to apply the following df.apply command to a dataframe but want it to skip the first row. Any advice on how to do that without setting the first row as the column headers?
res = sheet1[sheet1.apply(lambda row: row.astype(str).str.contains('TRUE', case=False).any(), axis=1)]
You can select from the index one and on as follow:
res = sheet1[1:].apply(lambda row: row.astype(str).str.contains('TRUE', case=False).any(), axis=1)
EDIT Version 3:
import pandas as pd
import random
df = pd.DataFrame({'a':range(1,11), 'b':range(2,21,2), 'c':range(1,20,2),
'd':['TRUE' if random.randint(0,1) else 'FALSE' for _ in range(10)]})
print (df)
res = df[df.apply(lambda row: row.astype(str).str.contains('TRUE', case=False).any(), axis=1)]
print (res.loc[1:])
If all you want to do is to get only the rows from 1 onwards, you can just do it as shown above:
The input Dataframe is:
a b c d
0 1 2 1 TRUE
1 2 4 3 FALSE
2 3 6 5 FALSE
3 4 8 7 TRUE
4 5 10 9 TRUE
5 6 12 11 TRUE
6 7 14 13 FALSE
7 8 16 15 FALSE
8 9 18 17 TRUE
9 10 20 19 TRUE
The output of res will be:
a b c d
0 1 2 1 TRUE
3 4 8 7 TRUE
4 5 10 9 TRUE
5 6 12 11 TRUE
8 9 18 17 TRUE
9 10 20 19 TRUE
The output of res[1:] - excluding first row will be:
a b c d
3 4 8 7 TRUE
4 5 10 9 TRUE
5 6 12 11 TRUE
8 9 18 17 TRUE
9 10 20 19 TRUE
EDIT Version 2:
Here's an example with 'TRUE' and 'FALSE' in the column.
import pandas as pd
import random
df = pd.DataFrame({'a':['TRUE' if random.randint(0,1) else 'FALSE' for _ in range(10)]})
print (df)
res = df.iloc[1:].apply(lambda row: row.astype(str).str.contains('TRUE', case=False).any(), axis=1)
print (res)
The output will be:
Original DataFrame:
a
0 TRUE
1 TRUE
2 FALSE
3 FALSE
4 FALSE
5 TRUE
6 TRUE
7 FALSE
8 FALSE
9 FALSE
Result from the DataFrame:
1 True
2 False
3 False
4 False
5 True
6 True
7 False
8 False
9 False
dtype: bool
You can also give loc instead of iloc:
res = df.loc[1:].apply(lambda row: row.astype(str).str.contains('TRUE', case=False).any(), axis=1)
As you can see, it skipped the first row.
Old answer
Here's an example:
import pandas as pd
df = pd.DataFrame({'a':range(1,11), 'b':range(2,21,2), 'c':range(1,20,2)})
print (df)
res = df.iloc[1:,:].apply(lambda x: x+10,axis=1)
print (res)
Original DataFrame:
a b c
0 1 2 1
1 2 4 3
2 3 6 5
3 4 8 7
4 5 10 9
5 6 12 11
6 7 14 13
7 8 16 15
8 9 18 17
9 10 20 19
Only rows 1 onwards got modified:
a b c
1 12 14 13
2 13 16 15
3 14 18 17
4 15 20 19
5 16 22 21
6 17 24 23
7 18 26 25
8 19 28 27
9 20 30 29
I am trying to count up a number during a sequence change.
The number shall always be +1, when changing from the negative to the positive range.
Here the code:
data = {'a': [-1,-1,-2,-3,4,5,6,-2,-2,-3,6,3,6,7,-1,-5,-7,1,34,5]}
df = pd.DataFrame (data)
df['p'] = df.a > 0
df['g'] = (df['p'] != df['p'].shift()).cumsum()
This is the output:
0 -1 False 1
1 -1 False 1
2 -2 False 1
3 -3 False 1
4 4 True 2
5 5 True 2
6 6 True 2
7 -2 False 3
8 -2 False 3
9 -3 False 3
10 6 True 4
11 3 True 4
12 6 True 4
13 7 True 4
14 -1 False 5
I need an output that looks like this:
0 -1 False 1
1 -1 False 1
2 -2 False 1
3 -3 False 1
4 4 True 2
5 5 True 2
6 6 True 2
7 -2 False 2
8 -2 False 2
9 -3 False 2
10 6 True 3
11 3 True 3
12 6 True 3
13 7 True 3
14 -1 False 3
Anybody got an idea?
You can match mask by & for bitwise AND:
df['p'] = df.a > 0
df['g'] = ((df['p'] != df['p'].shift()) & df['p']).cumsum() + 1
Another idea is filter by mask in column p, forward filling missing values replace NaN by first group and add 1:
df['p'] = df.a > 0
df['g'] = ((df['p'] != df['p'].shift()))[df['p']].cumsum()
df['g'] = df['g'].ffill().fillna(0).astype(int) + 1
Solution with differencies, without helper p column:
df['g'] = df.a.gt(0).view('i1').diff().gt(0).cumsum().add(1)
print (df)
a p g
0 -1 False 1
1 -1 False 1
2 -2 False 1
3 -3 False 1
4 4 True 2
5 5 True 2
6 6 True 2
7 -2 False 2
8 -2 False 2
9 -3 False 2
10 6 True 3
11 3 True 3
12 6 True 3
13 7 True 3
14 -1 False 3
15 -5 False 3
16 -7 False 3
17 1 True 4
18 34 True 4
19 5 True 4
Use np.sign and diff, gt, and cumsum
s = np.sign(df.a)
df['g'] = s.diff().gt(0).cumsum() + 1
Out[255]:
a p g
0 -1 False 1
1 -1 False 1
2 -2 False 1
3 -3 False 1
4 4 True 2
5 5 True 2
6 6 True 2
7 -2 False 2
8 -2 False 2
9 -3 False 2
10 6 True 3
11 3 True 3
12 6 True 3
13 7 True 3
14 -1 False 3
15 -5 False 3
16 -7 False 3
17 1 True 4
18 34 True 4
19 5 True 4
You can try this:
df["g"] = (df.p.astype(int).diff() > 0).astype(int).cumsum() + 1
output:
a p g
0 -1 False 1
1 -1 False 1
2 -2 False 1
3 -3 False 1
4 4 True 2
5 5 True 2
6 6 True 2
7 -2 False 2
8 -2 False 2
9 -3 False 2
10 6 True 3
11 3 True 3
12 6 True 3
13 7 True 3
14 -1 False 3
15 -5 False 3
16 -7 False 3
17 1 True 4
18 34 True 4
19 5 True 4
i am a newbie in python. please assist. I have a huge dataframe consisting of thousands of rows. an example of the df is shown below.
STATE VOLUME
INDEX
1 on 10
2 on 15
3 on 10
4 off 20
5 off 30
6 on 15
7 on 20
8 off 10
9 off 30
10 off 10
11 on 20
12 off 25
i want to be able to index this data based on the 'state' column such that the first batch of 'on' and 'off' registers as index 1, the next batch of 'on' and 'off' registers as index 2 etc etc... i want to be able to select a group of data if i select rows with index 1.
ID VOLUME
INDEX
1 on 10
1 on 15
1 on 10
1 off 20
1 off 30
2 on 15
2 on 20
2 off 10
2 off 30
2 off 10
3 on 20
3 off 25
You can try this with pd.Series.eq with pd.Series.shift and take cumsum using pd.Series.cumsum
df.index = (df['STATE'].eq('off').shift() & df['STATE'].eq('on')).cumsum() + 1
df.index.name = 'INDEX'
STATE VOLUME
INDEX
1 on 10
1 on 15
1 on 10
1 off 20
1 off 30
2 on 15
2 on 20
2 off 10
2 off 30
2 off 10
3 on 20
3 off 25
Details
The idea is to find where an off is followed by an on.
# (df['STATE'].eq('off').shift() & df['STATE'].eq('on')).cumsum() + 1
eq(off).shift eq(on) eq(off).shift & eq(on)
INDEX
1 NaN True False
2 False True False
3 False True False
4 False False False
5 True False False
6 True True True
7 False True False
8 False False False
9 True False False
10 True False False
11 True True True
12 False False False
You could try this with pd.Series.shift and pd.Series.cumsum:
df.index=((df.STATE.shift(-1) != df.STATE)&df.STATE.eq('off')).shift(fill_value=0).cumsum()+1
Same as this with np.where:
temp=pd.Series(np.where((df.STATE.shift(-1) != df.STATE)&(df.STATE.eq('off')),1,0))
df.index=temp.shift(1,fill_value=0).cumsum().astype(int).add(1)
Output:
df
STATE VOLUME
1 on 10
1 on 15
1 on 10
1 off 20
1 off 30
2 on 15
2 on 20
2 off 10
2 off 30
2 off 10
3 on 20
3 off 25
Explanation:
With (df.STATE.shift(-1) != df.STATE)&df.STATE.eq('off'), you will get a mask with the last value when it changes to 'off':
(df.STATE.shift(-1) != df.STATE)&df.STATE.eq('off')
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 False
10 True
11 False
12 True
Then you shift it to include that last value, and then you do a cumsum() knowing that True: 1 and False: 0:
((df.STATE.shift(-1) != df.STATE)&df.STATE.eq('off')).shift(fill_value=0)
1 0
2 False
3 False
4 False
5 False
6 True
7 False
8 False
9 False
10 False
11 True
12 False
((df.STATE.shift(-1) != df.STATE)&df.STATE.eq('off')).shift(fill_value=0).cumsum()
1 0
2 0
3 0
4 0
5 0
6 1
7 1
8 1
9 1
10 1
11 2
12 2
And finally you add 1(+1) to the index, to get the desired result.
I'm using python and need to solve the dataframe as cumsum() the value until the boolean column change its value from True to False. How to solve this task?
Bool Value Expected_cumsum
0 False 1 1
1 False 2 3
2 False 4 7
3 True 1 8
4 False 3 3 << reset from here
5 False 5 8
6 True 2 10
....
Thank all!
You can try this
a = df.Bool.eq(True).cumsum().shift().fillna(0)
df['Expected_cumsum']= df.groupby(a)['Value'].cumsum()
df
Output
Bool Value Expected_cumsum
0 False 1 1
1 False 2 3
2 False 4 7
3 True 1 8
4 False 3 3
5 False 5 8
6 True 2 10
Imagine we have a dataframe:
num line
0 1 56
1 1 90
2 2 66
3 3 4
4 3 55
5 3 104
6 1 23
7 5 22
8 3 144
I want to remove the rows where specifically a 3 is repeated in the num column, and keep the first. So the two rows with repeating 1's in the num column should still be in the resulting DataFrame together with all the other columns.
What I have so far, which removes every double value, not only the 3's:
data.groupby((data['num'] != data['num'].shift()).cumsum().values).first()
Expected result or correct code:
num line
0 1 56
1 1 90
2 2 66
3 3 4
4 1 23
5 5 22
6 3 144
Use:
df = data[data['num'].ne(3) | data['num'].ne(data['num'].shift())]
print (df)
num line
0 1 56
1 1 90
2 2 66
3 3 4
6 1 23
7 5 22
8 3 144
Detail:
Compare for not equal:
print (data['num'].ne(3))
0 True
1 True
2 True
3 False
4 False
5 False
6 True
7 True
8 False
Name: num, dtype: bool
Compare by shifted values for first consecutive:
print (data['num'].ne(data['num'].shift()))
0 True
1 False
2 True
3 True
4 False
5 False
6 True
7 True
8 True
Name: num, dtype: bool
Chain by | for bitwise OR:
print (data['num'].ne(3) | data['num'].ne(data['num'].shift()))
0 True
1 True
2 True
3 True
4 False
5 False
6 True
7 True
8 True
Name: num, dtype: bool
You could use the bellow conditions in order to perform boolean indexation in the dataframe:
# True where num is 3
c1 = df['num'].eq(3)
# True where num is repeated
c2 = df['num'].eq(df['num'].shift(1))
# boolean indexation on df
df[(c1 & ~c2) | ~(c1)]
num line
0 1 56
1 1 90
2 2 66
3 3 4
6 1 23
7 5 22
8 3 144
Details
df.assign(is_3=c1, is_repeated=c2, filtered=(c1 & ~c2) | ~(c1))
num line is_3 is_repeated filtered
0 1 56 False False True
1 1 90 False True True
2 2 66 False False True
3 3 4 True False True
4 3 55 True True False
5 3 104 True True False
6 1 23 False False True
7 5 22 False False True
8 3 144 True False True