pandas create groups based on previous value

pandas create groups based on previous value - python

I have a DataFrame which is sorted by an integer column v1:
v1
0 1
1 5
2 6
3 12
4 15
5 23
6 24
7 25
8 33
I want to group values in v1 like this: If a value - prev_value < 5, they have the same group.
For that, I want to give an increasing number for each group.
So I want to create another column, v1_group, which will have the output:
v1 v1_group
0 1 1
1 5 1
2 6 1
3 12 2 # 12 - 6 > 5, new group
4 15 2
5 23 3
6 24 3
7 25 3
8 33 4
I need to do the same task with a datetime column: group values if value - prev_value < timedelta.
I know I can solve this using a standard for loop. Is there a better pandas way?

IIUC,
df['v1_group'] = df.v1.diff().ge(5).cumsum() + 1
Output:
v1 v1_group
0 1 1
1 5 1
2 6 1
3 12 2
4 15 2
5 23 3
6 24 3
7 25 3
8 33 4

Related

Replicating rows of a Pandas dataframe based on a column condition

I have a dataframe looking like this:
Weekday Day_in_Month Starting_hour Ending_hour Power
3 1 1 3 35
3 1 3 7 15
4 2 22 2 5
.
.
.
I want to duplicate every column until the Starting_hour matches the Ending_hour.
-> All values of the row should be the same, but the Starting_hour value should change by Starting_hour + 1 for every new row.
The final dataframe should look like the following:
Weekday Day_in_Month Starting_hour Ending_hour Power
3 1 1 3 35
3 1 2 3 35
3 1 3 3 35
3 1 3 7 15
3 1 4 7 15
3 1 5 7 15
3 1 6 7 15
3 1 7 7 15
4 2 22 2 5
4 2 23 2 5
4 2 24 2 5
4 2 1 2 5
4 2 2 2 5
I appreciate any ideas on it, thanks!

Use Index.repeat with subtracted values and repeat rows by DataFrame.loc, then add counter to Starting_hour by GroupBy.cumcount:
df1 = df.loc[df.index.repeat(df['Ending_hour'].sub(df['Starting_hour']).add(1))]
df1['Starting_hour'] += df1.groupby(level=0).cumcount()
df1 = df1.reset_index(drop=True)
print (df1)
EDIT: If possible greater Starting_hour add 24 to Ending_hour, then in last step remove 1 for starting hours by 0, use modulo by 24 and last add 1:
m = df['Starting_hour'].gt(df['Ending_hour'])
e = df['Ending_hour'].mask(m, df['Ending_hour'].add(24))
df1 = df.loc[df.index.repeat(e.sub(df['Starting_hour']).add(1))]
df1['Starting_hour'] = (df1['Starting_hour'].add(df1.groupby(level=0).cumcount())
.sub(1).mod(24).add(1))
df1 = df1.reset_index(drop=True)
print (df1)
Weekday Day_in_Month Starting_hour Ending_hour Power
0 3 1 1 3 35
1 3 1 2 3 35
2 3 1 3 3 35
3 3 1 3 7 15
4 3 1 4 7 15
5 3 1 5 7 15
6 3 1 6 7 15
7 3 1 7 7 15
8 4 2 22 2 5
9 4 2 23 2 5
10 4 2 24 2 5
11 4 2 1 2 5
12 4 2 2 2 5

Fill values from one dataframe into another dataframe based on index of the two

I have two dataframes that contain time series data.
Dataframe A contains data of timestep 1, with index values getting incremented by 1 each time.
Dataframe B contains data of timestep n, with index values getting incremented by n each time.
I wish to do the following:
Add a column in Dataframe A and fill values from Dataframe B such that if the index value of that row in A lies between that of consecutive indexes in B, I fill the same value for all such rows in A.
I will illustrate this as below:
A:
id val1
0 2
1 3
2 4
3 1
4 6
5 23
6 2
7 12
8 56
9 34
10 90
...
B:
id tval
0 1
3 5
6 9
9 34
12 3434
...
Now, my result should like the following:
A:
id val1 tval
0 2 1
1 3 1
2 4 1
3 1 5
4 6 5
5 23 5
6 2 9
7 12 9
8 56 9
9 34 34
10 90 34
...
I would like to automate this for any n.

Use merge_asof:
df = pd.merge_asof(A, B, left_index=True, right_index=True)
print (df)
val1 tval
id
0 2 1
1 3 1
2 4 1
3 1 5
4 6 5
5 23 5
6 2 9
7 12 9
8 56 9
9 34 34
10 90 34
If id is columns:
df = pd.merge_asof(A, B, on='id')
print (df)
id val1 tval
0 0 2 1
1 1 3 1
2 2 4 1
3 3 1 5
4 4 6 5
5 5 23 5
6 6 2 9
7 7 12 9
8 8 56 9
9 9 34 34
10 10 90 34

Pandas create new column ID based on values from other columns need to be matched

I'm new to programming and Pandas. Therefore, please do not judge strictly and sorry for my explanations.
I have basically two-columns (DM1_ID, DM2_ID) and I need to create a new column('NewID') base on those two columns values. Basically I'm doing is creating a new ID for both columns. Here first evaluate the value in the 1st column and get that value and put it into the 'NewID' column.
Also, when we do that, need to consider DM2_ID and when that id comes in DM1_ID I need to give the same DM1_ID in NewID column.
As an example in 0 indexe has DM1_ID 1 and DM2_ID 6, I need to put 1 as NewID for both ids. When DM1_ID comes to 6 (index 15) no matter what in DM2_ID I need to give the 1 as NewID since I gave both DM1_ID 1 and DM1_ID 6. So it will be 1. Also, I need to consider that DM2_ID to latter use and it'll be also 1.
(index 15 DM1_ID 6, and DM2_ID 45 since I already gave newId as 1 for both 1 and 6 I have to give 1 for DM1_ID 6. Also for 45, I need to give 1 as a NewID(index 21).)
#I have a large table like this
DM1_ID DM2_ID
0 1 6
1 1 7
2 1 15
3 2 5
4 2 10
5 3 21
6 3 28
7 3 32
8 3 35
9 4 39
10 5 2
11 5 10
12 6 1
13 6 7
14 6 15
15 6 45
16 6 55
17 7 1
18 7 6
19 7 15
20 10 75
21 45 120
22 45 10
23 10 27
24 10 28
25 2 335
#I need to create this table
DM1_ID DM2_ID abc
0 1 6 1
1 1 7 1
2 1 15 1
3 2 5 2
4 2 10 2
5 3 21 3
6 3 28 3
7 3 32 3
8 3 35 3
9 4 39 4
10 5 2 2
11 5 10 2
12 6 1 1
13 6 7 1
14 6 15 1
15 6 45 1
16 6 55 1
17 7 1 1
18 7 6 1
19 7 15 1
20 10 75 2
21 45 120 1
22 45 10 2
23 10 27 2
24 10 28 2
25 2 335 2
Any help would be appreciated. Thanks.

One way to achieve your goal is to persist your IDs first. You can then use this persisted map table/dictionary to assign uniqued IDs once conditions are met. I have included an example with dictionary as below but you can alternatively use a database or a JSON file for persisting your given IDs:
df['pairs'] = df.apply(lambda x: [x[0], x[1]], axis=1)
pairs = df['pairs'].tolist()
u = {}
u_ = {}
for p in pairs:
if u:
if not u_:
u_ = u.copy()
else:
u = u_.copy()
for k in list(u.keys()):
if any(x in u[k] for x in p):
u_.update(
{
k: list(set(u[k] + p))
}
)
else:
pass
vals = [j for i in list(u.values()) for j in i]
if u == u_ and not any(x in vals for x in p):
n = max(list(u_.keys())) + 1
u_[n] = p
else:
pass
else:
u[1] = p
u_
Output:
{1: [1, 6, 7, 45, 15, 55, 120],
2: [75, 2, 10, 5],
3: [32, 35, 3, 21, 28],
4: [4, 39]}
Now let's apply a function that assigns new ID per row based on the dictionary we have created in the previous step:
f = lambda x: next(k for k,v in u_.items() if any(i in v for i in x))
df['new_ID'] = df['pairs'].apply(f)
df.drop('pairs', axis=1, inplace=True)
df
Output:
DM1_ID DM2_ID new_ID
0 1 6 1
1 1 7 1
2 1 15 1
3 2 5 2
4 2 10 2
5 3 21 3
6 3 28 3
7 3 32 3
8 3 35 3
9 4 39 4
10 5 2 2
11 5 10 2
12 6 1 1
13 6 7 1
14 6 15 1
15 6 45 1
16 6 55 1
17 7 1 1
18 7 6 1
19 7 15 1
20 10 75 2
21 45 120 1

Is there a way to compute a cumulative sum in python while ensuring the same values have the same maximum sum value

I have a frequency column in my dataframe.
frequency
1
1
1
1
2
2
3
4
5
5
5
5
I'd like to calculate the cumulative sum for it while ensuring that all the same frequency values have the same maximum cumulative sum value, like so
frequency cumsum
1 35
1 35
1 35
1 35
2 31
2 31
3 27
4 24
5 20
5 20
5 20
5 20
I can do it in google bigquery with this syntax
select
frequency,
sum(frequency) over (order by frequency desc) as cumsum
from `project1.dataset1.table1`
I've tried this in python
df['cumsum'] = df['frequency'].sort_values(ascending=False).cumsum()
Which gives me this
frequency cumsum
1 5
1 4
1 3
1 2
2 31
2 29
3 27
4 24
5 20
5 15
5 10
5 5
So I tried adding this syntax:
df['max_cumsum'] = df['frequency'].apply(lambda x: df[df['frequency'] == x]['cumsum'].max())
but it runs forever. I'm clearly doing something wrong here. Please throw me a lifeline

You can try
df['New'] = df.groupby('frequency')['cumsum'].transform('max')

Let's try map:
df['cumsum'] = df['frequency'].map(df['frequency'].groupby(df['frequency']).sum()
.sort_index(ascending=False)
.cumsum()
)
Output:
frequency cumsum
0 1 35
1 1 35
2 1 35
3 1 35
4 2 31
5 2 31
6 3 27
7 4 24
8 5 20
9 5 20
10 5 20
11 5 20

Creating new column in pandas from values of other column

I am dealing with the following dataframe:
p q
0 11 2
1 11 2
2 11 2
3 11 3
4 11 3
5 12 2
6 12 2
7 13 2
8 13 2
I want create a new column, say s, which starts with 0 and goes on. This new col is based on "p" column, whenever the p changes, the "s" should change too.
For the first 4 rows, the "p" = 11, so "s" column should have values 0 for this first 4 rows, and so on...
Below is the expectant df:
s p q
0 0 11 2
1 0 11 2
2 0 11 2
3 0 11 2
4 1 11 4
5 1 11 4
6 1 11 4
7 1 11 4
8 2 12 2
9 2 12 2
10 2 12 2
11 3 12 3
12 3 12 3

You need diff with cumsum (subtract one if you want the id to start from 0):
df["finalID"] = (df.ProjID.diff() != 0).cumsum()
df
Update, if you want to take both voyg_id and ProjID into consideration, you can use a OR condition on the two columns difference, so that whichever column changes, you get an increase in the final id.
df['final_id'] = ((df.voyg_id.diff() != 0) | (df.proj_id.diff() != 0)).cumsum()
df

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas create groups based on previous value - python

IIUC, df['v1_group'] = df.v1.diff().ge(5).cumsum() + 1 Output: v1 v1_group 0 1 1 1 5 1 2 6 1 3 12 2 4 15 2 5 23 3 6 24 3 7 25 3 8 33 4

Related

Replicating rows of a Pandas dataframe based on a column condition

Fill values from one dataframe into another dataframe based on index of the two

Pandas create new column ID based on values from other columns need to be matched

Is there a way to compute a cumulative sum in python while ensuring the same values have the same maximum sum value

Creating new column in pandas from values of other column

Categories

Resources