I have the following code:
data={'id':[1,2,3,4,5,6,7,8,9,10,11],
'value':[1,0,1,0,1,1,1,0,0,1,0]}
df=pd.DataFrame.from_dict(data)
df
Out[8]:
id value
0 1 1
1 2 0
2 3 1
3 4 0
4 5 1
5 6 1
6 7 1
7 8 0
8 9 0
9 10 1
10 11 0
I want to create a flag column that indicate with 1 consecutive values starting from the second occurrence and ignoring the first.
With the actual solution:
df['flag'] =
df.value.groupby([df.value,df.flag.diff().ne(0).cumsum()]).transform('size').ge(3).astype(int)
Out[8]:
id value flag
0 1 1 0
1 2 0 0
2 3 1 0
3 4 0 0
4 5 1 1
5 6 1 1
6 7 1 1
7 8 0 1
8 9 0 1
9 10 1 0
10 11 0 0
While I need a solution like this, where the first occurence is flagged as 0 and 1 starting from the second:
Out[8]:
id value flag
0 1 1 0
1 2 0 0
2 3 1 0
3 4 0 0
4 5 1 0
5 6 1 1
6 7 1 1
7 8 0 0
8 9 0 1
9 10 1 0
10 11 0 0
Create consecutive groups by compared Series.shifted values by not equal and Series.cumsum, create counter by GroupBy.cumcount and compare if greater values like 0 by Series.gt, last map True, False to 1, 0 by casting to integers by Series.astype:
df['flag'] = (df.groupby(df['value'].ne(df['value'].shift()).cumsum())
.cumcount()
.gt(0)
.astype(int))
print (df)
id value flag
0 1 1 0
1 2 0 0
2 3 1 0
3 4 0 0
4 5 1 0
5 6 1 1
6 7 1 1
7 8 0 0
8 9 0 1
9 10 1 0
10 11 0 0
How it working:
print (df.assign(g = df['value'].ne(df['value'].shift()).cumsum(),
counter = df.groupby(df['value'].ne(df['value'].shift()).cumsum()).cumcount(),
mask = df.groupby(df['value'].ne(df['value'].shift()).cumsum()).cumcount().gt(0)))
id value g counter mask
0 1 1 1 0 False
1 2 0 2 0 False
2 3 1 3 0 False
3 4 0 4 0 False
4 5 1 5 0 False
5 6 1 5 1 True
6 7 1 5 2 True
7 8 0 6 0 False
8 9 0 6 1 True
9 10 1 7 0 False
10 11 0 8 0 False
Use groupby.cumcount and a custom grouper:
# group by identical successive values
grp = df['value'].ne(df['value'].shift()).cumsum()
# flag all but the first one (>0)
# convert the booleans True/False to integers 1/0
df['flag'] = df.groupby(grp).cumcount().gt(0).astype(int)
Generic code to skip first N:
N = 1
grp = df['value'].ne(df['value'].shift()).cumsum()
df['flag'] = df.groupby(grp).cumcount().ge(N).astype(int)
Output:
id value flag
0 1 1 0
1 2 0 0
2 3 1 0
3 4 0 0
4 5 1 0
5 6 1 1
6 7 1 1
7 8 0 0
8 9 0 1
9 10 1 0
10 11 0 0
Related
I have a dataframe like this:
vehicle_id trip
0 0 0
1 0 0
2 0 0
3 0 1
4 0 1
5 1 0
6 1 0
7 1 1
8 1 1
9 1 1
10 1 1
11 1 1
12 1 2
13 2 0
14 2 1
15 2 2
I want to add a column that counts the frequency of each trip value for each 'vehicle id' group and drop the rows where the frequency is equal to 'one'. So after adding the column the frequency will be like this:
vehicle_id trip frequency
0 0 0 3
1 0 0 3
2 0 0 3
3 0 1 2
4 0 1 2
5 1 0 2
6 1 0 2
7 1 1 5
8 1 1 5
9 1 1 5
10 1 1 5
11 1 1 5
12 1 2 1
13 2 0 1
14 2 1 1
15 2 2 1
and the final result will be like this
vehicle_id trip frequency
0 0 0 3
1 0 0 3
2 0 0 3
3 0 1 2
4 0 1 2
5 1 0 2
6 1 0 2
7 1 1 5
8 1 1 5
9 1 1 5
10 1 1 5
11 1 1 5
what is the best solution for that? Also, what should I do if I intend to directly drop rows where the frequency is equal to 1 in each group (without adding the frequency column)?
Check the collab here :
https://colab.research.google.com/drive/1AuBTuW7vWj1FbJzhPuE-QoLncoF5W_7W?usp=sharing
You can use df.groupby() :
df["frequency"] = df.groupby(["vehicle_id","trip"]).transform("count")
But of course you need to create the frequency column before_hand :
df["frequency"] = 0
If I take your dataframe as example this gives :
import pandas as pd
dict = {"vehicle_id" : [0,0,0,0,0,1,1,1,1,1,1,1],
"trip" : [0,0,0,1,1,0,0,1,1,1,1,1]}
df = pd.DataFrame.from_dict(dict)
df["frequency"] = 0
df["frequency"] = df.groupby(["vehicle_id","trip"]).transform("count")
output :
Try:
df["frequency"] = (
df.assign(frequency=0).groupby(["vehicle_id", "trip"]).transform("count")
)
print(df[df.frequency > 1])
Prints:
vehicle_id trip frequency
0 0 0 3
1 0 0 3
2 0 0 3
3 0 1 2
4 0 1 2
5 1 0 2
6 1 0 2
7 1 1 5
8 1 1 5
9 1 1 5
10 1 1 5
11 1 1 5
Say I had a dataframe column of ones and zeros, and I wanted to group by clusters of where the value is 1. Using groupby would ordinarily render 2 groups, a single group of zeros, and a single group of ones.
df = pd.DataFrame([1,1,1,0,0,0,0,1,1,0,0,0,1,0,1,1,1],columns=['clusters'])
print df
clusters
0 1
1 1
2 1
3 0
4 0
5 0
6 0
7 1
8 1
9 0
10 0
11 0
12 1
13 0
14 1
15 1
16 1
for k, g in df.groupby(by=df.clusters):
print k, g
0 clusters
3 0
4 0
5 0
6 0
9 0
10 0
11 0
13 0
1 clusters
0 1
1 1
2 1
7 1
8 1
12 1
14 1
15 1
16 1
So in effect, I need to have a new column with a unique identifier for all clusters of 1: hence we would end up with:
clusters unique
0 1 1
1 1 1
2 1 1
3 0 0
4 0 0
5 0 0
6 0 0
7 1 2
8 1 2
9 0 0
10 0 0
11 0 0
12 1 3
13 0 0
14 1 4
15 1 4
16 1 4
Any help welcome. Thanks.
Let us do ngroup
m = df['clusters'].eq(0)
df['unqiue'] = df.groupby(m.cumsum()[~m]).ngroup() + 1
clusters unqiue
0 1 1
1 1 1
2 1 1
3 0 0
4 0 0
5 0 0
6 0 0
7 1 2
8 1 2
9 0 0
10 0 0
11 0 0
12 1 3
13 0 0
14 1 4
15 1 4
16 1 4
Using a mask:
m = df['clusters'].eq(0)
df['unique'] = m.ne(m.shift()).mask(m, False).cumsum().mask(m, 0)
output:
clusters unique
0 1 1
1 1 1
2 1 1
3 0 0
4 0 0
5 0 0
6 0 0
7 1 2
8 1 2
9 0 0
10 0 0
11 0 0
12 1 3
13 0 0
14 1 4
15 1 4
16 1 4
Below is the code and output, what I'm trying to get is shown in the "exp" column, as you can see the "countif" column just counts 5 columns, but I want it to only count negative values.
So for example: index 0, df1[0] should equal 2
What am I doing wrong?
Python
import pandas as pd
import numpy as np
a = ['A','B','C','B','C','A','A','B','C','C','A','C','B','A']
b = [2,4,1,1,2,5,-1,2,2,3,4,3,3,3]
c = [-2,4,1,-1,2,5,1,2,2,3,4,3,3,3]
d = [-2,-4,1,-1,2,5,1,2,2,3,4,3,3,3]
exp = [2,1,0,2,0,0,1,0,0,0,0,0,0,0]
df1 = pd.DataFrame({'b':b,'c':c,'d':d,'exp':exp}, columns=['b','c','d','exp'])
df1['sumif'] = df1.where(df1<0,0).sum(1)
df1['countif'] = df1.where(df1<0,0).count(1)
df1
# df1.sort_values(['a','countif'], ascending=[True, True])
Output
You don't need where here, you can simply use df.lt with df.sum(axis=1):
In [1329]: df1['exp'] = df1.lt(0).sum(1)
In [1330]: df1
Out[1330]:
b c d exp
0 2 -2 -2 2
1 4 4 -4 1
2 1 1 1 0
3 1 -1 -1 2
4 2 2 2 0
5 5 5 5 0
6 -1 1 1 1
7 2 2 2 0
8 2 2 2 0
9 3 3 3 0
10 4 4 4 0
11 3 3 3 0
12 3 3 3 0
13 3 3 3 0
EDIT: As per OP's comment including solution with iloc and .lt:
In [1609]: df1['exp'] = df1.iloc[:, :3].lt(0).sum(1)
First DataFrame.where working different, it replace False values to 0 here by condition (here False are greater of equal 0), so cannot be used for count:
print (df1.iloc[:, :3].where(df1<0,0))
b c d
0 0 -2 -2
1 0 0 -4
2 0 0 0
3 0 -1 -1
4 0 0 0
5 0 0 0
6 -1 0 0
7 0 0 0
8 0 0 0
9 0 0 0
10 0 0 0
11 0 0 0
12 0 0 0
13 0 0 0
You need compare first 3 columns for less like 0 and sum:
df1['exp1'] = (df1.iloc[:, :3] < 0).sum(1)
#If need compare all columns
#df1['exp1'] = (df1 < 0).sum(1)
print (df1)
b c d exp exp1
0 2 -2 -2 2 2
1 4 4 -4 1 1
2 1 1 1 0 0
3 1 -1 -1 2 2
4 2 2 2 0 0
5 5 5 5 0 0
6 -1 1 1 1 1
7 2 2 2 0 0
8 2 2 2 0 0
9 3 3 3 0 0
10 4 4 4 0 0
11 3 3 3 0 0
12 3 3 3 0 0
13 3 3 3 0 0
I have a pandas dataframe as follows:
df2
amount 1 2 3 4
0 5 1 1 1 1
1 7 0 1 1 1
2 9 0 0 0 1
3 8 0 0 1 0
4 2 0 0 0 1
What I want to do is replace the 1s on every row with the value of the amount field in that row and leave the zeros as is. The output should look like this
amount 1 2 3 4
0 5 5 5 5 5
1 7 0 7 7 7
2 9 0 0 0 9
3 8 0 0 8 0
4 2 0 0 0 2
I've tried applying a lambda function row-wise like this, but I'm running into errors
df2.apply(lambda x: x.loc[i].replace(0, x['amount']) for i in len(x), axis=1)
Any help would be much appreciated. Thanks
Let's use mask:
df2.mask(df2 == 1, df2['amount'], axis=0)
Output:
amount 1 2 3 4
0 5 5 5 5 5
1 7 0 7 7 7
2 9 0 0 0 9
3 8 0 0 8 0
4 2 0 0 0 2
You can also do it wit pandas.DataFrame.mul() method, like this:
>>> df2.iloc[:, 1:] = df2.iloc[:, 1:].mul(df2['amount'], axis=0)
>>> print(df2)
amount 1 2 3 4
0 5 5 5 5 5
1 7 0 7 7 7
2 9 0 0 0 9
3 8 0 0 8 0
4 2 0 0 0 2
I have
{"A":[0,1], "B":[4,5], "C":[0,1], "D":[0,1]}
what I want it
A B C D
0 4 0 0
0 4 0 1
0 4 1 0
0 4 1 1
1 4 0 1
...and so on. Basically all the combinations of values for each of the categories.
What would be the best way to achieve this?
If x is your dict:
>>> pandas.DataFrame(list(itertools.product(*x.values())), columns=x.keys())
A C B D
0 0 0 4 0
1 0 0 4 1
2 0 0 5 0
3 0 0 5 1
4 0 1 4 0
5 0 1 4 1
6 0 1 5 0
7 0 1 5 1
8 1 0 4 0
9 1 0 4 1
10 1 0 5 0
11 1 0 5 1
12 1 1 4 0
13 1 1 4 1
14 1 1 5 0
15 1 1 5 1
If you want the columns in a particular order you'll need to switch them afterwards (with, e.g., df[["A", "B", "C", "D"]].