I have a Dataframe with a column called No.. I need to count the number of consecutive 0s in column No.. For example, the first 0 is recorded as 1, and the second 0 is recorded as 2. If it encounters 1, the counter is cleared. And save the result in the column count.
what should I do?
An example of my Dataframe is as follows:
import numpy as np
import pandas as pd
np.random.seed(2021)
a = np.random.randint(0, 2, 20)
df = pd.DataFrame(a, columns=['No.'])
print(df)
No.
0 0
1 1
2 1
3 0
4 1
5 0
6 0
7 0
8 1
9 0
10 1
11 1
12 1
13 1
14 0
15 0
16 0
17 0
18 0
19 0
The result I need:
No. count
0 0 1
1 1 0
2 1 0
3 0 1
4 1 0
5 0 1
6 0 2
7 0 3
8 1 0
9 0 1
10 1 0
11 1 0
12 1 0
13 1 0
14 0 1
15 0 2
16 0 3
17 0 4
18 0 5
19 0 6
Generate pseudo-groups with cumsum and then generate within-group counters with groupby.cumsum:
groups = df['No.'].ne(0).cumsum()
df['count'] = df['No.'].eq(0).groupby(groups).cumsum()
Output:
No. count
0 0 1
1 1 0
2 1 0
3 0 1
4 1 0
5 0 1
6 0 2
7 0 3
8 1 0
9 0 1
10 1 0
11 1 0
12 1 0
13 1 0
14 0 1
15 0 2
16 0 3
17 0 4
18 0 5
19 0 6
Related
I have a dataframe like this:
vehicle_id trip
0 0 0
1 0 0
2 0 0
3 0 1
4 0 1
5 1 0
6 1 0
7 1 1
8 1 1
9 1 1
10 1 1
11 1 1
12 1 2
13 2 0
14 2 1
15 2 2
I want to add a column that counts the frequency of each trip value for each 'vehicle id' group and drop the rows where the frequency is equal to 'one'. So after adding the column the frequency will be like this:
vehicle_id trip frequency
0 0 0 3
1 0 0 3
2 0 0 3
3 0 1 2
4 0 1 2
5 1 0 2
6 1 0 2
7 1 1 5
8 1 1 5
9 1 1 5
10 1 1 5
11 1 1 5
12 1 2 1
13 2 0 1
14 2 1 1
15 2 2 1
and the final result will be like this
vehicle_id trip frequency
0 0 0 3
1 0 0 3
2 0 0 3
3 0 1 2
4 0 1 2
5 1 0 2
6 1 0 2
7 1 1 5
8 1 1 5
9 1 1 5
10 1 1 5
11 1 1 5
what is the best solution for that? Also, what should I do if I intend to directly drop rows where the frequency is equal to 1 in each group (without adding the frequency column)?
Check the collab here :
https://colab.research.google.com/drive/1AuBTuW7vWj1FbJzhPuE-QoLncoF5W_7W?usp=sharing
You can use df.groupby() :
df["frequency"] = df.groupby(["vehicle_id","trip"]).transform("count")
But of course you need to create the frequency column before_hand :
df["frequency"] = 0
If I take your dataframe as example this gives :
import pandas as pd
dict = {"vehicle_id" : [0,0,0,0,0,1,1,1,1,1,1,1],
"trip" : [0,0,0,1,1,0,0,1,1,1,1,1]}
df = pd.DataFrame.from_dict(dict)
df["frequency"] = 0
df["frequency"] = df.groupby(["vehicle_id","trip"]).transform("count")
output :
Try:
df["frequency"] = (
df.assign(frequency=0).groupby(["vehicle_id", "trip"]).transform("count")
)
print(df[df.frequency > 1])
Prints:
vehicle_id trip frequency
0 0 0 3
1 0 0 3
2 0 0 3
3 0 1 2
4 0 1 2
5 1 0 2
6 1 0 2
7 1 1 5
8 1 1 5
9 1 1 5
10 1 1 5
11 1 1 5
Say I had a dataframe column of ones and zeros, and I wanted to group by clusters of where the value is 1. Using groupby would ordinarily render 2 groups, a single group of zeros, and a single group of ones.
df = pd.DataFrame([1,1,1,0,0,0,0,1,1,0,0,0,1,0,1,1,1],columns=['clusters'])
print df
clusters
0 1
1 1
2 1
3 0
4 0
5 0
6 0
7 1
8 1
9 0
10 0
11 0
12 1
13 0
14 1
15 1
16 1
for k, g in df.groupby(by=df.clusters):
print k, g
0 clusters
3 0
4 0
5 0
6 0
9 0
10 0
11 0
13 0
1 clusters
0 1
1 1
2 1
7 1
8 1
12 1
14 1
15 1
16 1
So in effect, I need to have a new column with a unique identifier for all clusters of 1: hence we would end up with:
clusters unique
0 1 1
1 1 1
2 1 1
3 0 0
4 0 0
5 0 0
6 0 0
7 1 2
8 1 2
9 0 0
10 0 0
11 0 0
12 1 3
13 0 0
14 1 4
15 1 4
16 1 4
Any help welcome. Thanks.
Let us do ngroup
m = df['clusters'].eq(0)
df['unqiue'] = df.groupby(m.cumsum()[~m]).ngroup() + 1
clusters unqiue
0 1 1
1 1 1
2 1 1
3 0 0
4 0 0
5 0 0
6 0 0
7 1 2
8 1 2
9 0 0
10 0 0
11 0 0
12 1 3
13 0 0
14 1 4
15 1 4
16 1 4
Using a mask:
m = df['clusters'].eq(0)
df['unique'] = m.ne(m.shift()).mask(m, False).cumsum().mask(m, 0)
output:
clusters unique
0 1 1
1 1 1
2 1 1
3 0 0
4 0 0
5 0 0
6 0 0
7 1 2
8 1 2
9 0 0
10 0 0
11 0 0
12 1 3
13 0 0
14 1 4
15 1 4
16 1 4
I'm stuck in this issue. Considering that in column “value” the number 1 appears, then the next column "trigger” displays the number 1 in the next 5 cells.
Please consider the following example:
Index values
1 0
2 0
3 1
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 1
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
The expected result should be as follows:
Index values trigger
1 0 0
2 0 0
3 1 0
4 0 1
5 0 1
6 0 1
7 0 1
8 0 1
9 0 0
10 0 0
11 1 0
12 0 1
13 0 1
14 0 1
15 0 1
16 0 1
17 0 0
18 0 0
19 0 0
20 0 0
Series.ffill
m = df['values'].eq(1)
df['trigger'] = df['values'].where(m).ffill(limit=5).mask(m).fillna(0,
downcast='int')
Or
df['trigger'] = (df['values'].shift().where(lambda x: x.eq(1))
.ffill(limit=4).fillna(0, downcast='int'))
Output
print(df)
Index values trigger
0 1 0 0
1 2 0 0
2 3 1 0
3 4 0 1
4 5 0 1
5 6 0 1
6 7 0 1
7 8 0 1
8 9 0 0
9 10 0 0
10 11 1 0
11 12 0 1
12 13 0 1
13 14 0 1
14 15 0 1
15 16 0 1
16 17 0 0
17 18 0 0
18 19 0 0
19 20 0 0
You could use .fillna(df['value']) if you want keep values of column values
I'm having trouble randomly splitting DataFrame df into groups of smaller DataFrames.
df
movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
0 1 5 4 0 4 4 0 0 0 4 0 0 21
1 2 3 0 0 3 0 0 0 0 0 0 0 6
2 3 4 0 0 0 0 0 0 0 0 0 0 4
3 4 3 0 0 0 0 5 0 0 4 0 5 17
4 5 3 0 0 0 0 0 0 0 0 0 0 3
5 6 5 0 0 0 0 0 0 5 0 0 0 10
6 7 4 0 0 0 2 5 3 4 4 0 0 22
7 8 1 0 0 0 4 5 0 0 0 4 0 14
8 9 5 0 0 0 4 5 0 0 4 5 0 23
9 10 3 2 0 0 0 4 0 0 0 0 0 9
10 11 2 0 4 0 0 3 3 0 4 2 0 18
11 12 5 0 0 0 4 5 0 0 5 2 0 21
12 13 5 4 0 0 2 0 0 0 3 0 0 14
13 14 5 4 0 0 5 0 0 0 0 0 0 14
14 15 5 0 0 0 3 0 0 0 0 5 5 18
15 16 5 0 0 0 0 0 0 0 4 0 0 9
16 17 3 0 0 4 0 0 0 0 0 0 0 7
17 18 4 0 0 0 0 0 0 0 0 0 0 4
18 19 5 3 0 0 4 0 0 0 0 0 0 12
19 20 4 0 0 0 0 0 0 0 0 0 0 4
20 21 1 0 0 3 3 0 0 0 0 0 0 7
21 22 4 0 0 0 3 5 5 0 5 4 0 26
22 23 4 0 0 0 4 3 0 0 5 0 0 16
23 24 3 0 0 4 0 0 0 0 0 3 0 10
I've tried sample and arange, but with bad results.
ran1 = df.sample(frac=0.2, replace=False, random_state=1)
ran2 = df.sample(frac=0.2, replace=False, random_state=1)
ran3 = df.sample(frac=0.2, replace=False, random_state=1)
ran4 = df.sample(frac=0.2, replace=False, random_state=1)
ran5 = df.sample(frac=0.2, replace=False, random_state=1)
print(ran1, '\n')
print(ran2, '\n')
print(ran3, '\n')
print(ran4, '\n')
print(ran5, '\n')
This turned out to be 5 exact same DataFrames.
movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
13 14 5 4 0 0 5 0 0 0 0 0 0 14
18 19 5 3 0 0 4 0 0 0 0 0 0 12
3 4 3 0 0 0 0 5 0 0 4 0 5 17
14 15 5 0 0 0 3 0 0 0 0 5 5 18
20 21 1 0 0 3 3 0 0 0 0 0 0 7
Also I've tried :
g = df.groupby(['movie_id'])
h = np.arange(g.ngroups)
np.random.shuffle(h)
df[g.ngroup().isin(h[:6])]
The output :
movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
4 5 3 0 0 0 0 0 0 0 0 0 0 3
6 7 4 0 0 0 2 5 3 4 4 0 0 22
7 8 1 0 0 0 4 5 0 0 0 4 0 14
16 17 3 0 0 4 0 0 0 0 0 0 0 7
17 18 4 0 0 0 0 0 0 0 0 0 0 4
18 19 5 3 0 0 4 0 0 0 0 0 0 12
But there's still only one smaller group, other datas from df aren't grouped.
I'm expecting the smaller groups to be split evenly by using percentage. And the whole df should be split into groups.
Use np.array_split
shuffled = df.sample(frac=1)
result = np.array_split(shuffled, 5)
df.sample(frac=1) shuffle the rows of df. Then use np.array_split split it into parts that have equal size.
It gives you:
for part in result:
print(part,'\n')
movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
5 6 5 0 0 0 0 0 0 5 0 0 0 10
4 5 3 0 0 0 0 0 0 0 0 0 0 3
7 8 1 0 0 0 4 5 0 0 0 4 0 14
16 17 3 0 0 4 0 0 0 0 0 0 0 7
22 23 4 0 0 0 4 3 0 0 5 0 0 16
movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
13 14 5 4 0 0 5 0 0 0 0 0 0 14
14 15 5 0 0 0 3 0 0 0 0 5 5 18
21 22 4 0 0 0 3 5 5 0 5 4 0 26
1 2 3 0 0 3 0 0 0 0 0 0 0 6
20 21 1 0 0 3 3 0 0 0 0 0 0 7
movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
10 11 2 0 4 0 0 3 3 0 4 2 0 18
9 10 3 2 0 0 0 4 0 0 0 0 0 9
11 12 5 0 0 0 4 5 0 0 5 2 0 21
8 9 5 0 0 0 4 5 0 0 4 5 0 23
12 13 5 4 0 0 2 0 0 0 3 0 0 14
movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
18 19 5 3 0 0 4 0 0 0 0 0 0 12
3 4 3 0 0 0 0 5 0 0 4 0 5 17
0 1 5 4 0 4 4 0 0 0 4 0 0 21
23 24 3 0 0 4 0 0 0 0 0 3 0 10
6 7 4 0 0 0 2 5 3 4 4 0 0 22
movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
17 18 4 0 0 0 0 0 0 0 0 0 0 4
2 3 4 0 0 0 0 0 0 0 0 0 0 4
15 16 5 0 0 0 0 0 0 0 4 0 0 9
19 20 4 0 0 0 0 0 0 0 0 0 0 4
A simple demo:
df = pd.DataFrame({"movie_id": np.arange(1, 25),
"borda": np.random.randint(1, 25, size=(24,))})
n_split = 5
# the indices used to select parts from dataframe
ixs = np.arange(df.shape[0])
np.random.shuffle(ixs)
# np.split cannot work when there is no equal division
# so we need to find out the split points ourself
# we need (n_split-1) split points
split_points = [i*df.shape[0]//n_split for i in range(1, n_split)]
# use these indices to select the part we want
for ix in np.split(ixs, split_points):
print(df.iloc[ix])
The result:
borda movie_id
8 3 9
10 2 11
22 14 23
7 14 8
borda movie_id
0 16 1
20 4 21
17 15 18
15 1 16
6 6 7
borda movie_id
9 9 10
19 4 20
5 1 6
16 23 17
21 20 22
borda movie_id
11 24 12
23 5 24
1 22 2
12 7 13
18 15 19
borda movie_id
3 11 4
14 10 15
2 6 3
4 7 5
13 21 14
IIUC, you can do this:
frames={}
for e,i in enumerate(np.split(df,6)):
frames.update([('df_'+str(e+1),pd.DataFrame(np.random.permutation(i),columns=df.columns))])
print(frames['df_1'])
movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
0 4 3 0 0 0 0 5 0 0 4 0 5 17
1 3 4 0 0 0 0 0 0 0 0 0 0 4
2 2 3 0 0 3 0 0 0 0 0 0 0 6
3 1 5 4 0 4 4 0 0 0 4 0 0 21
Explanation: np.split(df,6) splits the df to 6 equal size.
pd.DataFrame(np.random.permutation(i),columns=df.columns) randomly reshapes the rows so creating a dataframe with this information and storing in a dictionary names frames.
Finally print the dictionary by calling each keys, values as dataframe will be returned. you can try print frames['df_1'] , frames['df_2'] , etc. It will return random permutations of a split of the dataframe.
I would like to slice a dataframe to return rows where element x=0 appears consecutively at least n=3 times, and then dropping the first i=2 instances in each mini-sequence
is there an efficient way of achieving in pandas, and if not, using numpy or scipy?
import pandas as pd
import numpy as np
Example 1
df=pd.DataFrame({'A':[0,1,0,0,1,1,0,0,0,0,1,1,0,0,0,1,1],'B':np.random.randn(17)})
A B
0 0 0.748958
1 1 0.254730
2 0 0.629609
3 0 0.272738
4 1 -1.885906
5 1 1.206371
6 0 -0.332471
7 0 0.217553
8 0 0.768986
9 0 -1.607236
10 1 1.613650
11 1 -1.096892
12 0 -0.435762
13 0 0.131284
14 0 -0.177188
15 1 1.393890
16 1 0.174803
Desired output:
A B
8 0 0.768986
9 0 -1.607236
14 0 -0.177188
Example 2
x=0 (element of interest)
n=5 (min length of sequence)
i=2 (drop first two in each sequence)
df2=pd.DataFrame({'A':[0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0],'B':np.random.randn(20)})
A B
0 0 0.703803
1 0 -0.144088
2 0 0.635577
3 0 -0.834611
4 0 1.472271
5 0 -0.554860
6 0 -0.167016
7 1 0.578847
8 1 -1.873663
9 0 0.197062
10 0 1.458845
11 0 -1.921660
12 0 -1.301481
13 0 0.240197
14 0 -1.425058
15 1 -2.801151
16 0 0.766757
17 0 1.249806
18 0 0.595366
19 0 -1.447632
Desired output:
A B
2 0 0.635577
3 0 -0.834611
4 0 1.472271
5 0 -0.554860
6 0 -0.167016
11 0 -1.921660
12 0 -1.301481
13 0 0.240197
14 0 -1.425058
Here's an approach using some NumPy manipulations -
def slice_consc(df,n):
Acol = np.array(df['A'])
Acol_ext = np.concatenate(([0],(Acol==0)+0,[0]))
starts = np.where(np.diff(Acol_ext)==1)[0]
stops = np.where(np.diff(Acol_ext)==-1)[0]
id = np.zeros(Acol.size+2,dtype=int)
valid_mask = stops - starts >= n
id[stops[valid_mask]] = -1
id[starts[valid_mask]+2] = 1
return df[(id.cumsum()==1)[:-2]]
Sample runs -
Case #1:
>>> df
A B
0 0 0.977325
1 1 -0.408457
2 0 -0.377973
3 0 0.567537
4 1 -0.222019
5 1 -1.167422
6 0 -0.142546
7 0 0.675458
8 0 -0.184456
9 0 -0.826050
10 1 -0.772413
11 1 -1.556440
12 0 -0.687249
13 0 -0.481676
14 0 0.420400
15 1 0.031999
16 1 -1.092540
>>> slice_consc(df,3)
A B
8 0 -0.184456
9 0 -0.826050
14 0 0.420400
Case #2:
>>> df2
A B
0 0 0.757102
1 0 2.114935
2 0 -0.352309
3 0 -0.214931
4 0 -1.626064
5 0 -0.989776
6 0 0.639635
7 1 0.049358
8 1 -2.600326
9 0 0.057792
10 0 1.263418
11 0 0.618495
12 0 -1.637054
13 0 1.220862
14 0 1.245484
15 1 1.388218
16 0 -0.499900
17 0 0.761310
18 0 -1.308176
19 0 -2.005983
>>> slice_consc(df2,5)
A B
2 0 -0.352309
3 0 -0.214931
4 0 -1.626064
5 0 -0.989776
6 0 0.639635
11 0 0.618495
12 0 -1.637054
13 0 1.220862
14 0 1.245484