Give unique identifiers to clusters containing the same value - python

Say I had a dataframe column of ones and zeros, and I wanted to group by clusters of where the value is 1. Using groupby would ordinarily render 2 groups, a single group of zeros, and a single group of ones.
df = pd.DataFrame([1,1,1,0,0,0,0,1,1,0,0,0,1,0,1,1,1],columns=['clusters'])
print df
clusters
0 1
1 1
2 1
3 0
4 0
5 0
6 0
7 1
8 1
9 0
10 0
11 0
12 1
13 0
14 1
15 1
16 1
for k, g in df.groupby(by=df.clusters):
print k, g
0 clusters
3 0
4 0
5 0
6 0
9 0
10 0
11 0
13 0
1 clusters
0 1
1 1
2 1
7 1
8 1
12 1
14 1
15 1
16 1
So in effect, I need to have a new column with a unique identifier for all clusters of 1: hence we would end up with:
clusters unique
0 1 1
1 1 1
2 1 1
3 0 0
4 0 0
5 0 0
6 0 0
7 1 2
8 1 2
9 0 0
10 0 0
11 0 0
12 1 3
13 0 0
14 1 4
15 1 4
16 1 4
Any help welcome. Thanks.

Let us do ngroup
m = df['clusters'].eq(0)
df['unqiue'] = df.groupby(m.cumsum()[~m]).ngroup() + 1
clusters unqiue
0 1 1
1 1 1
2 1 1
3 0 0
4 0 0
5 0 0
6 0 0
7 1 2
8 1 2
9 0 0
10 0 0
11 0 0
12 1 3
13 0 0
14 1 4
15 1 4
16 1 4

Using a mask:
m = df['clusters'].eq(0)
df['unique'] = m.ne(m.shift()).mask(m, False).cumsum().mask(m, 0)
output:
clusters unique
0 1 1
1 1 1
2 1 1
3 0 0
4 0 0
5 0 0
6 0 0
7 1 2
8 1 2
9 0 0
10 0 0
11 0 0
12 1 3
13 0 0
14 1 4
15 1 4
16 1 4

Related

Add a column based on frequency for each group

I have a dataframe like this:
vehicle_id trip
0 0 0
1 0 0
2 0 0
3 0 1
4 0 1
5 1 0
6 1 0
7 1 1
8 1 1
9 1 1
10 1 1
11 1 1
12 1 2
13 2 0
14 2 1
15 2 2
I want to add a column that counts the frequency of each trip value for each 'vehicle id' group and drop the rows where the frequency is equal to 'one'. So after adding the column the frequency will be like this:
vehicle_id trip frequency
0 0 0 3
1 0 0 3
2 0 0 3
3 0 1 2
4 0 1 2
5 1 0 2
6 1 0 2
7 1 1 5
8 1 1 5
9 1 1 5
10 1 1 5
11 1 1 5
12 1 2 1
13 2 0 1
14 2 1 1
15 2 2 1
and the final result will be like this
vehicle_id trip frequency
0 0 0 3
1 0 0 3
2 0 0 3
3 0 1 2
4 0 1 2
5 1 0 2
6 1 0 2
7 1 1 5
8 1 1 5
9 1 1 5
10 1 1 5
11 1 1 5
what is the best solution for that? Also, what should I do if I intend to directly drop rows where the frequency is equal to 1 in each group (without adding the frequency column)?
Check the collab here :
https://colab.research.google.com/drive/1AuBTuW7vWj1FbJzhPuE-QoLncoF5W_7W?usp=sharing
You can use df.groupby() :
df["frequency"] = df.groupby(["vehicle_id","trip"]).transform("count")
But of course you need to create the frequency column before_hand :
df["frequency"] = 0
If I take your dataframe as example this gives :
import pandas as pd
dict = {"vehicle_id" : [0,0,0,0,0,1,1,1,1,1,1,1],
"trip" : [0,0,0,1,1,0,0,1,1,1,1,1]}
df = pd.DataFrame.from_dict(dict)
df["frequency"] = 0
df["frequency"] = df.groupby(["vehicle_id","trip"]).transform("count")
output :
Try:
df["frequency"] = (
df.assign(frequency=0).groupby(["vehicle_id", "trip"]).transform("count")
)
print(df[df.frequency > 1])
Prints:
vehicle_id trip frequency
0 0 0 3
1 0 0 3
2 0 0 3
3 0 1 2
4 0 1 2
5 1 0 2
6 1 0 2
7 1 1 5
8 1 1 5
9 1 1 5
10 1 1 5
11 1 1 5

How to count the current consecutive 0 in the Dataframe column?

I have a Dataframe with a column called No.. I need to count the number of consecutive 0s in column No.. For example, the first 0 is recorded as 1, and the second 0 is recorded as 2. If it encounters 1, the counter is cleared. And save the result in the column count.
what should I do?
An example of my Dataframe is as follows:
import numpy as np
import pandas as pd
np.random.seed(2021)
a = np.random.randint(0, 2, 20)
df = pd.DataFrame(a, columns=['No.'])
print(df)
No.
0 0
1 1
2 1
3 0
4 1
5 0
6 0
7 0
8 1
9 0
10 1
11 1
12 1
13 1
14 0
15 0
16 0
17 0
18 0
19 0
The result I need:
No. count
0 0 1
1 1 0
2 1 0
3 0 1
4 1 0
5 0 1
6 0 2
7 0 3
8 1 0
9 0 1
10 1 0
11 1 0
12 1 0
13 1 0
14 0 1
15 0 2
16 0 3
17 0 4
18 0 5
19 0 6
Generate pseudo-groups with cumsum and then generate within-group counters with groupby.cumsum:
groups = df['No.'].ne(0).cumsum()
df['count'] = df['No.'].eq(0).groupby(groups).cumsum()
Output:
No. count
0 0 1
1 1 0
2 1 0
3 0 1
4 1 0
5 0 1
6 0 2
7 0 3
8 1 0
9 0 1
10 1 0
11 1 0
12 1 0
13 1 0
14 0 1
15 0 2
16 0 3
17 0 4
18 0 5
19 0 6

If column has certain value, the next column displays value for a number of cells

I'm stuck in this issue. Considering that in column “value” the number 1 appears, then the next column "trigger” displays the number 1 in the next 5 cells.
Please consider the following example:
Index values
1 0
2 0
3 1
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 1
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
The expected result should be as follows:
Index values trigger
1 0 0
2 0 0
3 1 0
4 0 1
5 0 1
6 0 1
7 0 1
8 0 1
9 0 0
10 0 0
11 1 0
12 0 1
13 0 1
14 0 1
15 0 1
16 0 1
17 0 0
18 0 0
19 0 0
20 0 0
Series.ffill
m = df['values'].eq(1)
df['trigger'] = df['values'].where(m).ffill(limit=5).mask(m).fillna(0,
downcast='int')
Or
df['trigger'] = (df['values'].shift().where(lambda x: x.eq(1))
.ffill(limit=4).fillna(0, downcast='int'))
Output
print(df)
Index values trigger
0 1 0 0
1 2 0 0
2 3 1 0
3 4 0 1
4 5 0 1
5 6 0 1
6 7 0 1
7 8 0 1
8 9 0 0
9 10 0 0
10 11 1 0
11 12 0 1
12 13 0 1
13 14 0 1
14 15 0 1
15 16 0 1
16 17 0 0
17 18 0 0
18 19 0 0
19 20 0 0
You could use .fillna(df['value']) if you want keep values of column values

How to randomly split a DataFrame into several smaller DataFrames?

I'm having trouble randomly splitting DataFrame df into groups of smaller DataFrames.
df
movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
0 1 5 4 0 4 4 0 0 0 4 0 0 21
1 2 3 0 0 3 0 0 0 0 0 0 0 6
2 3 4 0 0 0 0 0 0 0 0 0 0 4
3 4 3 0 0 0 0 5 0 0 4 0 5 17
4 5 3 0 0 0 0 0 0 0 0 0 0 3
5 6 5 0 0 0 0 0 0 5 0 0 0 10
6 7 4 0 0 0 2 5 3 4 4 0 0 22
7 8 1 0 0 0 4 5 0 0 0 4 0 14
8 9 5 0 0 0 4 5 0 0 4 5 0 23
9 10 3 2 0 0 0 4 0 0 0 0 0 9
10 11 2 0 4 0 0 3 3 0 4 2 0 18
11 12 5 0 0 0 4 5 0 0 5 2 0 21
12 13 5 4 0 0 2 0 0 0 3 0 0 14
13 14 5 4 0 0 5 0 0 0 0 0 0 14
14 15 5 0 0 0 3 0 0 0 0 5 5 18
15 16 5 0 0 0 0 0 0 0 4 0 0 9
16 17 3 0 0 4 0 0 0 0 0 0 0 7
17 18 4 0 0 0 0 0 0 0 0 0 0 4
18 19 5 3 0 0 4 0 0 0 0 0 0 12
19 20 4 0 0 0 0 0 0 0 0 0 0 4
20 21 1 0 0 3 3 0 0 0 0 0 0 7
21 22 4 0 0 0 3 5 5 0 5 4 0 26
22 23 4 0 0 0 4 3 0 0 5 0 0 16
23 24 3 0 0 4 0 0 0 0 0 3 0 10
I've tried sample and arange, but with bad results.
ran1 = df.sample(frac=0.2, replace=False, random_state=1)
ran2 = df.sample(frac=0.2, replace=False, random_state=1)
ran3 = df.sample(frac=0.2, replace=False, random_state=1)
ran4 = df.sample(frac=0.2, replace=False, random_state=1)
ran5 = df.sample(frac=0.2, replace=False, random_state=1)
print(ran1, '\n')
print(ran2, '\n')
print(ran3, '\n')
print(ran4, '\n')
print(ran5, '\n')
This turned out to be 5 exact same DataFrames.
movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
13 14 5 4 0 0 5 0 0 0 0 0 0 14
18 19 5 3 0 0 4 0 0 0 0 0 0 12
3 4 3 0 0 0 0 5 0 0 4 0 5 17
14 15 5 0 0 0 3 0 0 0 0 5 5 18
20 21 1 0 0 3 3 0 0 0 0 0 0 7
Also I've tried :
g = df.groupby(['movie_id'])
h = np.arange(g.ngroups)
np.random.shuffle(h)
df[g.ngroup().isin(h[:6])]
The output :
movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
4 5 3 0 0 0 0 0 0 0 0 0 0 3
6 7 4 0 0 0 2 5 3 4 4 0 0 22
7 8 1 0 0 0 4 5 0 0 0 4 0 14
16 17 3 0 0 4 0 0 0 0 0 0 0 7
17 18 4 0 0 0 0 0 0 0 0 0 0 4
18 19 5 3 0 0 4 0 0 0 0 0 0 12
But there's still only one smaller group, other datas from df aren't grouped.
I'm expecting the smaller groups to be split evenly by using percentage. And the whole df should be split into groups.
Use np.array_split
shuffled = df.sample(frac=1)
result = np.array_split(shuffled, 5)
df.sample(frac=1) shuffle the rows of df. Then use np.array_split split it into parts that have equal size.
It gives you:
for part in result:
print(part,'\n')
movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
5 6 5 0 0 0 0 0 0 5 0 0 0 10
4 5 3 0 0 0 0 0 0 0 0 0 0 3
7 8 1 0 0 0 4 5 0 0 0 4 0 14
16 17 3 0 0 4 0 0 0 0 0 0 0 7
22 23 4 0 0 0 4 3 0 0 5 0 0 16
movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
13 14 5 4 0 0 5 0 0 0 0 0 0 14
14 15 5 0 0 0 3 0 0 0 0 5 5 18
21 22 4 0 0 0 3 5 5 0 5 4 0 26
1 2 3 0 0 3 0 0 0 0 0 0 0 6
20 21 1 0 0 3 3 0 0 0 0 0 0 7
movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
10 11 2 0 4 0 0 3 3 0 4 2 0 18
9 10 3 2 0 0 0 4 0 0 0 0 0 9
11 12 5 0 0 0 4 5 0 0 5 2 0 21
8 9 5 0 0 0 4 5 0 0 4 5 0 23
12 13 5 4 0 0 2 0 0 0 3 0 0 14
movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
18 19 5 3 0 0 4 0 0 0 0 0 0 12
3 4 3 0 0 0 0 5 0 0 4 0 5 17
0 1 5 4 0 4 4 0 0 0 4 0 0 21
23 24 3 0 0 4 0 0 0 0 0 3 0 10
6 7 4 0 0 0 2 5 3 4 4 0 0 22
movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
17 18 4 0 0 0 0 0 0 0 0 0 0 4
2 3 4 0 0 0 0 0 0 0 0 0 0 4
15 16 5 0 0 0 0 0 0 0 4 0 0 9
19 20 4 0 0 0 0 0 0 0 0 0 0 4
A simple demo:
df = pd.DataFrame({"movie_id": np.arange(1, 25),
"borda": np.random.randint(1, 25, size=(24,))})
n_split = 5
# the indices used to select parts from dataframe
ixs = np.arange(df.shape[0])
np.random.shuffle(ixs)
# np.split cannot work when there is no equal division
# so we need to find out the split points ourself
# we need (n_split-1) split points
split_points = [i*df.shape[0]//n_split for i in range(1, n_split)]
# use these indices to select the part we want
for ix in np.split(ixs, split_points):
print(df.iloc[ix])
The result:
borda movie_id
8 3 9
10 2 11
22 14 23
7 14 8
borda movie_id
0 16 1
20 4 21
17 15 18
15 1 16
6 6 7
borda movie_id
9 9 10
19 4 20
5 1 6
16 23 17
21 20 22
borda movie_id
11 24 12
23 5 24
1 22 2
12 7 13
18 15 19
borda movie_id
3 11 4
14 10 15
2 6 3
4 7 5
13 21 14
IIUC, you can do this:
frames={}
for e,i in enumerate(np.split(df,6)):
frames.update([('df_'+str(e+1),pd.DataFrame(np.random.permutation(i),columns=df.columns))])
print(frames['df_1'])
movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
0 4 3 0 0 0 0 5 0 0 4 0 5 17
1 3 4 0 0 0 0 0 0 0 0 0 0 4
2 2 3 0 0 3 0 0 0 0 0 0 0 6
3 1 5 4 0 4 4 0 0 0 4 0 0 21
Explanation: np.split(df,6) splits the df to 6 equal size.
pd.DataFrame(np.random.permutation(i),columns=df.columns) randomly reshapes the rows so creating a dataframe with this information and storing in a dictionary names frames.
Finally print the dictionary by calling each keys, values as dataframe will be returned. you can try print frames['df_1'] , frames['df_2'] , etc. It will return random permutations of a split of the dataframe.

Dataframe from all possible combinations of values of given categories

I have
{"A":[0,1], "B":[4,5], "C":[0,1], "D":[0,1]}
what I want it
A B C D
0 4 0 0
0 4 0 1
0 4 1 0
0 4 1 1
1 4 0 1
...and so on. Basically all the combinations of values for each of the categories.
What would be the best way to achieve this?
If x is your dict:
>>> pandas.DataFrame(list(itertools.product(*x.values())), columns=x.keys())
A C B D
0 0 0 4 0
1 0 0 4 1
2 0 0 5 0
3 0 0 5 1
4 0 1 4 0
5 0 1 4 1
6 0 1 5 0
7 0 1 5 1
8 1 0 4 0
9 1 0 4 1
10 1 0 5 0
11 1 0 5 1
12 1 1 4 0
13 1 1 4 1
14 1 1 5 0
15 1 1 5 1
If you want the columns in a particular order you'll need to switch them afterwards (with, e.g., df[["A", "B", "C", "D"]].

Categories