I have a DataFrame which is sorted by an integer column v1:
v1
0 1
1 5
2 6
3 12
4 15
5 23
6 24
7 25
8 33
I want to group values in v1 like this: If a value - prev_value < 5, they have the same group.
For that, I want to give an increasing number for each group.
So I want to create another column, v1_group, which will have the output:
v1 v1_group
0 1 1
1 5 1
2 6 1
3 12 2 # 12 - 6 > 5, new group
4 15 2
5 23 3
6 24 3
7 25 3
8 33 4
I need to do the same task with a datetime column: group values if value - prev_value < timedelta.
I know I can solve this using a standard for loop. Is there a better pandas way?
IIUC,
df['v1_group'] = df.v1.diff().ge(5).cumsum() + 1
Output:
v1 v1_group
0 1 1
1 5 1
2 6 1
3 12 2
4 15 2
5 23 3
6 24 3
7 25 3
8 33 4
Related
I have a dataframe looking like this:
Weekday Day_in_Month Starting_hour Ending_hour Power
3 1 1 3 35
3 1 3 7 15
4 2 22 2 5
.
.
.
I want to duplicate every column until the Starting_hour matches the Ending_hour.
-> All values of the row should be the same, but the Starting_hour value should change by Starting_hour + 1 for every new row.
The final dataframe should look like the following:
Weekday Day_in_Month Starting_hour Ending_hour Power
3 1 1 3 35
3 1 2 3 35
3 1 3 3 35
3 1 3 7 15
3 1 4 7 15
3 1 5 7 15
3 1 6 7 15
3 1 7 7 15
4 2 22 2 5
4 2 23 2 5
4 2 24 2 5
4 2 1 2 5
4 2 2 2 5
I appreciate any ideas on it, thanks!
Use Index.repeat with subtracted values and repeat rows by DataFrame.loc, then add counter to Starting_hour by GroupBy.cumcount:
df1 = df.loc[df.index.repeat(df['Ending_hour'].sub(df['Starting_hour']).add(1))]
df1['Starting_hour'] += df1.groupby(level=0).cumcount()
df1 = df1.reset_index(drop=True)
print (df1)
EDIT: If possible greater Starting_hour add 24 to Ending_hour, then in last step remove 1 for starting hours by 0, use modulo by 24 and last add 1:
m = df['Starting_hour'].gt(df['Ending_hour'])
e = df['Ending_hour'].mask(m, df['Ending_hour'].add(24))
df1 = df.loc[df.index.repeat(e.sub(df['Starting_hour']).add(1))]
df1['Starting_hour'] = (df1['Starting_hour'].add(df1.groupby(level=0).cumcount())
.sub(1).mod(24).add(1))
df1 = df1.reset_index(drop=True)
print (df1)
Weekday Day_in_Month Starting_hour Ending_hour Power
0 3 1 1 3 35
1 3 1 2 3 35
2 3 1 3 3 35
3 3 1 3 7 15
4 3 1 4 7 15
5 3 1 5 7 15
6 3 1 6 7 15
7 3 1 7 7 15
8 4 2 22 2 5
9 4 2 23 2 5
10 4 2 24 2 5
11 4 2 1 2 5
12 4 2 2 2 5
I have two dataframes that contain time series data.
Dataframe A contains data of timestep 1, with index values getting incremented by 1 each time.
Dataframe B contains data of timestep n, with index values getting incremented by n each time.
I wish to do the following:
Add a column in Dataframe A and fill values from Dataframe B such that if the index value of that row in A lies between that of consecutive indexes in B, I fill the same value for all such rows in A.
I will illustrate this as below:
A:
id val1
0 2
1 3
2 4
3 1
4 6
5 23
6 2
7 12
8 56
9 34
10 90
...
B:
id tval
0 1
3 5
6 9
9 34
12 3434
...
Now, my result should like the following:
A:
id val1 tval
0 2 1
1 3 1
2 4 1
3 1 5
4 6 5
5 23 5
6 2 9
7 12 9
8 56 9
9 34 34
10 90 34
...
I would like to automate this for any n.
Use merge_asof:
df = pd.merge_asof(A, B, left_index=True, right_index=True)
print (df)
val1 tval
id
0 2 1
1 3 1
2 4 1
3 1 5
4 6 5
5 23 5
6 2 9
7 12 9
8 56 9
9 34 34
10 90 34
If id is columns:
df = pd.merge_asof(A, B, on='id')
print (df)
id val1 tval
0 0 2 1
1 1 3 1
2 2 4 1
3 3 1 5
4 4 6 5
5 5 23 5
6 6 2 9
7 7 12 9
8 8 56 9
9 9 34 34
10 10 90 34
I'm new to programming and Pandas. Therefore, please do not judge strictly and sorry for my explanations.
I have basically two-columns (DM1_ID, DM2_ID) and I need to create a new column('NewID') base on those two columns values. Basically I'm doing is creating a new ID for both columns. Here first evaluate the value in the 1st column and get that value and put it into the 'NewID' column.
Also, when we do that, need to consider DM2_ID and when that id comes in DM1_ID I need to give the same DM1_ID in NewID column.
As an example in 0 indexe has DM1_ID 1 and DM2_ID 6, I need to put 1 as NewID for both ids. When DM1_ID comes to 6 (index 15) no matter what in DM2_ID I need to give the 1 as NewID since I gave both DM1_ID 1 and DM1_ID 6. So it will be 1. Also, I need to consider that DM2_ID to latter use and it'll be also 1.
(index 15 DM1_ID 6, and DM2_ID 45 since I already gave newId as 1 for both 1 and 6 I have to give 1 for DM1_ID 6. Also for 45, I need to give 1 as a NewID(index 21).)
#I have a large table like this
DM1_ID DM2_ID
0 1 6
1 1 7
2 1 15
3 2 5
4 2 10
5 3 21
6 3 28
7 3 32
8 3 35
9 4 39
10 5 2
11 5 10
12 6 1
13 6 7
14 6 15
15 6 45
16 6 55
17 7 1
18 7 6
19 7 15
20 10 75
21 45 120
22 45 10
23 10 27
24 10 28
25 2 335
#I need to create this table
DM1_ID DM2_ID abc
0 1 6 1
1 1 7 1
2 1 15 1
3 2 5 2
4 2 10 2
5 3 21 3
6 3 28 3
7 3 32 3
8 3 35 3
9 4 39 4
10 5 2 2
11 5 10 2
12 6 1 1
13 6 7 1
14 6 15 1
15 6 45 1
16 6 55 1
17 7 1 1
18 7 6 1
19 7 15 1
20 10 75 2
21 45 120 1
22 45 10 2
23 10 27 2
24 10 28 2
25 2 335 2
Any help would be appreciated. Thanks.
One way to achieve your goal is to persist your IDs first. You can then use this persisted map table/dictionary to assign uniqued IDs once conditions are met. I have included an example with dictionary as below but you can alternatively use a database or a JSON file for persisting your given IDs:
df['pairs'] = df.apply(lambda x: [x[0], x[1]], axis=1)
pairs = df['pairs'].tolist()
u = {}
u_ = {}
for p in pairs:
if u:
if not u_:
u_ = u.copy()
else:
u = u_.copy()
for k in list(u.keys()):
if any(x in u[k] for x in p):
u_.update(
{
k: list(set(u[k] + p))
}
)
else:
pass
vals = [j for i in list(u.values()) for j in i]
if u == u_ and not any(x in vals for x in p):
n = max(list(u_.keys())) + 1
u_[n] = p
else:
pass
else:
u[1] = p
u_
Output:
{1: [1, 6, 7, 45, 15, 55, 120],
2: [75, 2, 10, 5],
3: [32, 35, 3, 21, 28],
4: [4, 39]}
Now let's apply a function that assigns new ID per row based on the dictionary we have created in the previous step:
f = lambda x: next(k for k,v in u_.items() if any(i in v for i in x))
df['new_ID'] = df['pairs'].apply(f)
df.drop('pairs', axis=1, inplace=True)
df
Output:
DM1_ID DM2_ID new_ID
0 1 6 1
1 1 7 1
2 1 15 1
3 2 5 2
4 2 10 2
5 3 21 3
6 3 28 3
7 3 32 3
8 3 35 3
9 4 39 4
10 5 2 2
11 5 10 2
12 6 1 1
13 6 7 1
14 6 15 1
15 6 45 1
16 6 55 1
17 7 1 1
18 7 6 1
19 7 15 1
20 10 75 2
21 45 120 1
I have a frequency column in my dataframe.
frequency
1
1
1
1
2
2
3
4
5
5
5
5
I'd like to calculate the cumulative sum for it while ensuring that all the same frequency values have the same maximum cumulative sum value, like so
frequency cumsum
1 35
1 35
1 35
1 35
2 31
2 31
3 27
4 24
5 20
5 20
5 20
5 20
I can do it in google bigquery with this syntax
select
frequency,
sum(frequency) over (order by frequency desc) as cumsum
from `project1.dataset1.table1`
I've tried this in python
df['cumsum'] = df['frequency'].sort_values(ascending=False).cumsum()
Which gives me this
frequency cumsum
1 5
1 4
1 3
1 2
2 31
2 29
3 27
4 24
5 20
5 15
5 10
5 5
So I tried adding this syntax:
df['max_cumsum'] = df['frequency'].apply(lambda x: df[df['frequency'] == x]['cumsum'].max())
but it runs forever. I'm clearly doing something wrong here. Please throw me a lifeline
You can try
df['New'] = df.groupby('frequency')['cumsum'].transform('max')
Let's try map:
df['cumsum'] = df['frequency'].map(df['frequency'].groupby(df['frequency']).sum()
.sort_index(ascending=False)
.cumsum()
)
Output:
frequency cumsum
0 1 35
1 1 35
2 1 35
3 1 35
4 2 31
5 2 31
6 3 27
7 4 24
8 5 20
9 5 20
10 5 20
11 5 20
I am dealing with the following dataframe:
p q
0 11 2
1 11 2
2 11 2
3 11 3
4 11 3
5 12 2
6 12 2
7 13 2
8 13 2
I want create a new column, say s, which starts with 0 and goes on. This new col is based on "p" column, whenever the p changes, the "s" should change too.
For the first 4 rows, the "p" = 11, so "s" column should have values 0 for this first 4 rows, and so on...
Below is the expectant df:
s p q
0 0 11 2
1 0 11 2
2 0 11 2
3 0 11 2
4 1 11 4
5 1 11 4
6 1 11 4
7 1 11 4
8 2 12 2
9 2 12 2
10 2 12 2
11 3 12 3
12 3 12 3
You need diff with cumsum (subtract one if you want the id to start from 0):
df["finalID"] = (df.ProjID.diff() != 0).cumsum()
df
Update, if you want to take both voyg_id and ProjID into consideration, you can use a OR condition on the two columns difference, so that whichever column changes, you get an increase in the final id.
df['final_id'] = ((df.voyg_id.diff() != 0) | (df.proj_id.diff() != 0)).cumsum()
df