Reshape excel table with stacked date column - python

I have an excel file that has diffrent weather stations and the minimum and maximum temprature of every month in a year like this:
Name
Time
Minimum
Maximum
Station 1
2020-01
-10
2
...
...
...
2020-12
-5
0
Station 2
2020-01
-8
4
...
...
...
2020-12
-6
4
Station 3
2020-01
-8
5
...
...
...
2020-12
-5
5
Station 4
2020-01
-9
5
...
...
...
2020-12
-11
4
Not quite well versed in Python, but have been trying to follow the pandas user guide and looking at forums using different methods to pivot the table with pandas so the header with the values below becomes like this, but without luck.
Date
Minimum - Station 1
Maximum - Station 1
...
Minimum - Station 4
Maximum - Station 4
This was my final attempt. It results into outputing the header in the wrong order, and the first row of values.
import pandas as pd
df = pd.read_excel('input.xlsx')
result = df.pivot_table(index='Time', columns='Name', values=['Minimum', 'Maximum'])
result.columns = ['_'.join(col) for col in result.columns]
result.to_excel('output.xlsx')

Using a pivot makes sense to me too. Is that where your problems is occurring? Then maybe some of the columns in your example data are part of the index of your input table. The most difficult part is to get all the columns sorted and named as in your expected output. This works for me with some sample data:
import numpy as np
import pandas as pd
# Sample data
df = pd.DataFrame({
"Name": np.repeat([f"Station {i}" for i in range(1, 5)], 12),
"Time": np.tile([f"2020-{i:02d}" for i in range(1, 13)], 4),
"Minimum": np.random.randint(-10, 0, size=48),
"Maximum": np.random.randint(0, 10, size=48)
})
df = df.pivot(index="Time", columns="Name")
df = df.sort_index(axis=1, level=[1, 0], ascending=[True, False])
df = df.set_axis(df.columns.to_flat_index().map(" - ".join), axis=1)
df = df.reset_index(names="Date")
output:
Date Minimum - Station 1 Maximum - Station 1 ... Maximum - Station 3 Minimum - Station 4 Maximum - Station 4
0 2020-01 -9 2 ... 1 -4 5
1 2020-02 -4 5 ... 2 -9 5
2 2020-03 -1 8 ... 2 -7 0
3 2020-04 -9 0 ... 9 -7 7
4 2020-05 -4 6 ... 1 -4 9
5 2020-06 -9 2 ... 6 -2 5
6 2020-07 -9 2 ... 1 -5 5
7 2020-08 -6 2 ... 8 -6 0
8 2020-09 -3 2 ... 2 -10 0
9 2020-10 -2 5 ... 1 -2 4
10 2020-11 -10 7 ... 7 -2 5
11 2020-12 -6 8 ... 3 -2 8

Related

joining a table to another table in pandas

I am trying to grab the data from, https://www.espn.com/nhl/standings
When I try to grab it, it is putting Florida Panthers one row to high and messing up the data. All the team names need to be shifted down a row. I have tried to mutate the data and tried,
dataset_one = dataset_one.shift(1)
and then joining with the stats table but I am getting NaN.
The docs seem to show a lot of ways of joining and merging data with similar columns headers but not sure the best solution here without a similar column header to join with.
Code:
import pandas as pd
page = pd.read_html('https://www.espn.com/nhl/standings')
dataset_one = page[0] # Team Names
dataset_two = page[1] # Stats
combined_data = dataset_one.join(dataset_two)
print(combined_data)
Output:
FLAFlorida Panthers GP W L OTL ... GF GA DIFF L10 STRK
0 CBJColumbus Blue Jackets 6 5 0 1 ... 22 16 6 5-0-1 W2
1 CARCarolina Hurricanes 10 4 3 3 ... 24 28 -4 4-3-3 L1
2 DALDallas Stars 6 5 1 0 ... 18 10 8 5-1-0 W4
3 TBTampa Bay Lightning 6 4 1 1 ... 23 14 9 4-1-1 L2
4 CHIChicago Blackhawks 6 4 1 1 ... 19 14 5 4-1-1 W1
5 NSHNashville Predators 10 3 4 3 ... 26 31 -5 3-4-3 W1
6 DETDetroit Red Wings 8 4 4 0 ... 20 24 -4 4-4-0 L1
Desired:
GP W L OTL ... GF GA DIFF L10 STRK
0 FLAFlorida Panthers 6 5 0 1 ... 22 16 6 5-0-1 W2
1 CBJColumbus Blue Jackets 10 4 3 3 ... 24 28 -4 4-3-3 L1
2 CARCarolina Hurricanes 6 5 1 0 ... 18 10 8 5-1-0 W4
3 DALDallas Stars 6 4 1 1 ... 23 14 9 4-1-1 L2
4 TBTampa Bay Lightning 6 4 1 1 ... 19 14 5 4-1-1 W1
5 CHIChicago Blackhawks 10 3 4 3 ... 26 31 -5 3-4-3 W1
6 NSHNashville Predators 8 4 4 0 ... 20 24 -4 4-4-0 L1
7 DETDetriot Red Wings 10 2 6 2 6 ... 20 35 -15 2-6-2 L6
Providing an alternative approach to #Noah's answer. You can first add an extra row, shift the df down by a row and then assign the header col as index 0 value.
import pandas as pd
page = pd.read_html('https://www.espn.com/nhl/standings')
dataset_one = page[0] # Team Names
dataset_two = page[1] # Stats
# Shifting down by one row
dataset_one.loc[max(dataset_one.index) + 1, :] = None
dataset_one = dataset_one.shift(1)
dataset_one.iloc[0] = dataset_one.columns
dataset_one.columns = ['team']
combined_data = dataset_one.join(dataset_two)
Just create the df slightly differently so it knows what is the proper header
dataset_one = pd.DataFrame(page[0], columns=["Team Name"])
Then when you join it should be aligned properly.
Another alternative is to do the following:
dataset_one = page[0].to_frame(name='Team Name')

How do I sort a whole pandas dataframe by one column, moving the rows grouped in 3s

I have a dataframe with genes (ensembl IDs and common name), homologs, counts, and totals in orders of three as such:
Index Zebrafish Homolog Human Homolog Total
0 ENSDARG00000019949 ENSG00000149257
1 serpinh1b SERPINH1
2 2 2 4
3 ENSDARG00000052437 ENSG00000268975
4 mia MIA-RAB4B
5 2 0 2
6 ENSDARG00000057992 ENSG00000134363
7 fstb FST
8 0 3 3
9 ENSDARG00000045580 ENSG00000139329
10 lum LUM
11 15 15 30
etc...
I want to sort these rows by the totals in descending order. such that all the rows are kept intact in groups of 3 in the orders shown. The ideal output would be:
Index Zebrafish Homolog Human Homolog Total
0 ENSDARG00000045580 ENSG00000139329
1 lum LUM
2 15 15 30
3 ENSDARG00000019949 ENSG00000149257
4 serpinh1b SERPINH1
5 2 2 4
6 ENSDARG00000057992 ENSG00000134363
7 fstb FST
8 0 3 3
9 ENSDARG00000052437 ENSG00000268975
10 mia MIA-RAB4B
11 2 0 2
etc...
I tried making the totals for each in all 3 rows and then sorting using dataframe.sort.values() and removing the previous 2 rows for each clump of 3 but it didnt work properly. Is there a way to group the rows together into clumps of 3, then sort them to maintain that structure? Thank you in advance for any assistance.
Update #1
If i try to use the code:
df['Total'] = df['Total'].bfill().astype(int)
df = df.sort_values(by='Total', ascending=False)
to add values to the total for each group of 3 and then sort, It partially works, but scrambles the code like this:
Index Zebrafish Homolog Human Homolog Total
0 ENSDARG00000045580 ENSG00000139329 30
1 lum LUM 30
2 15 15 30
4 serpinh1b SERPINH1 4
3 ENSDARG00000019949 ENSG00000149257 4
5 2 2 4
8 0 3 3
7 fstb FST 3
6 ENSDARG00000057992 ENSG00000134363 3
9 ENSDARG00000052437 ENSG00000268975 2
11 2 0 2
10 mia MIA-RAB4B 2
etc...
And even worse is if multiple genes have the same total counts, the rows will become interchanged between genes which becomes confusing
Is this a dead end? Maybe I should just rewrite the code a different way :(
You need to create a second key to keep the records together on sorting , see below:
df.Total= df.Total.bfill()
df["helper"]= np.arange(len(df))//3
df= df.sort_values(["Total","helper"])
df= df.drop(columns="helper")
It looks like your totals are missing values and that helps in this case
Approach 1
df['Total'] = df['Total'].bfill().astype(int)
df['idx'] = np.arange(len(df)) // 3
df = df.sort_values(by=['Total', 'idx'], ascending=False)
df = df.drop(['idx'], axis=1)
Zebrafish_Homolog Human_Homolog Total
9 ENSDARG00000045580 ENSG00000139329 30
10 lum LUM 30
11 15 15 30
0 ENSDARG00000019949 ENSG00000149257 4
1 serpinh1b SERPINH1 4
2 2 2 4
6 ENSDARG00000057992 ENSG00000134363 3
7 fstb FST 3
8 0 3 3
3 ENSDARG00000052437 ENSG00000268975 2
4 mia MIA-RAB4B 2
5 2 0 2
Note how the index stays the same, if you don't want that then reset_index()
df = df.reset_index(drop=True)
Approach 2
A more manual way of sorting.
The approach is to sort the index and then loc the df. It looks complicated but it's just subtract ints from a list. Note the process doesn't happen on the df until the end so there should be no speed issue for a larger df.
# Sort by total
df = df.reset_index().sort_values('Total', ascending=False)
# Get the index of the sorted values
uniq_index = df[df['Total'].notnull()]['index'].values
# Create the new index
index = uniq_index .repeat(3)
groups = [-2, -1, 0] * (len(df) // 3)
# Update so everything is in order
new_index = index + groups
# Apply to the dataframe
df = df.loc[new_index]
Zebrafish_Homolog Human_Homolog Total
0 ENSDARG00000045580 ENSG00000139329 NaN
1 lum LUM NaN
2 15 15 30.0
9 ENSDARG00000019949 ENSG00000149257 NaN
10 serpinh1b SERPINH1 NaN
11 2 2 4.0
3 ENSDARG00000057992 ENSG00000134363 NaN
4 fstb FST NaN
5 0 3 3.0
6 ENSDARG00000052437 ENSG00000268975 NaN
7 mia MIA-RAB4B NaN
8 2 0 2.0
12 ENSDARG00000052437 ENSG00000268975 NaN
13 mia MIA-RAB4B NaN
14 2 0 2.0

Compute a ratio conditional on the value in the column of a panda dataframe

I have a dataframe of the following type
df = pd.DataFrame({'Days':[1,2,5,6,7,10,11,12],
'Value':[100.3,150.5,237.0,314.15,188.0,413.0,158.2,268.0]})
Days Value
0 1 100.3
1 2 150.5
2 5 237.0
3 6 314.15
4 7 188.0
5 10 413.0
6 11 158.2
7 12 268.0
and I would like to add a column '+5Ratio' whose date is the ratio betwen Value corresponding to the Days+5 and Days.
For example in first row I would have 3.13210368893 = 314.15/100.3, in the second I would have 1.24916943522 = 188.0/150.5 and so on.
Days Value +5Ratio
0 1 100.3 3.13210368893
1 2 150.5 1.24916943522
2 5 237.0 ...
3 6 314.15
4 7 188.0
5 10 413.0
6 11 158.2
7 12 268.0
I'm strugling to find a way to do it using lambda function.
Could someone give a help to find a way to solve this problem?
Thanks in advance.
Edit
In the case I am interested in the "Days" field can vary sparsly from 1 to 18180 for instance.
You can using merge , and the benefit from doing this , can handle missing value
s=df.merge(df.assign(Days=df.Days-5),on='Days')
s.assign(Value=s.Value_y/s.Value_x).drop(['Value_x','Value_y'],axis=1)
Out[359]:
Days Value
0 1 3.132104
1 2 1.249169
2 5 1.742616
3 6 0.503581
4 7 1.425532
Consider left merging on a helper dataframe, days, for consecutive daily points and then shift by 5 rows for ratio calculation. Finally remove the blank day rows:
days_df = pd.DataFrame({'Days':range(min(df.Days), max(df.Days)+1)})
days_df = days_df.merge(df, on='Days', how='left')
print(days_df)
# Days Value
# 0 1 100.30
# 1 2 150.50
# 2 3 NaN
# 3 4 NaN
# 4 5 237.00
# 5 6 314.15
# 6 7 188.00
# 7 8 NaN
# 8 9 NaN
# 9 10 413.00
# 10 11 158.20
# 11 12 268.00
days_df['+5ratio'] = days_df.shift(-5)['Value'] / days_df['Value']
final_df = days_df[days_df['Value'].notnull()].reset_index(drop=True)
print(final_df)
# Days Value +5ratio
# 0 1 100.30 3.132104
# 1 2 150.50 1.249169
# 2 5 237.00 1.742616
# 3 6 314.15 0.503581
# 4 7 188.00 1.425532
# 5 10 413.00 NaN
# 6 11 158.20 NaN
# 7 12 268.00 NaN

Best way to join / merge by range in pandas

I'm frequently using pandas for merge (join) by using a range condition.
For instance if there are 2 dataframes:
A (A_id, A_value)
B (B_id,B_low, B_high, B_name)
which are big and approximately of the same size (let's say 2M records each).
I would like to make an inner join between A and B, so A_value would be between B_low and B_high.
Using SQL syntax that would be:
SELECT *
FROM A,B
WHERE A_value between B_low and B_high
and that would be really easy, short and efficient.
Meanwhile in pandas the only way (that's not using loops that I found), is by creating a dummy column in both tables, join on it (equivalent to cross-join) and then filter out unneeded rows. That sounds heavy and complex:
A['dummy'] = 1
B['dummy'] = 1
Temp = pd.merge(A,B,on='dummy')
Result = Temp[Temp.A_value.between(Temp.B_low,Temp.B_high)]
Another solution that I had is by applying on each of A value a search function on B by usingB[(x>=B.B_low) & (x<=B.B_high)] mask, but it sounds inefficient as well and might require index optimization.
Is there a more elegant and/or efficient way to perform this action?
Setup
Consider the dataframes A and B
A = pd.DataFrame(dict(
A_id=range(10),
A_value=range(5, 105, 10)
))
B = pd.DataFrame(dict(
B_id=range(5),
B_low=[0, 30, 30, 46, 84],
B_high=[10, 40, 50, 54, 84]
))
A
A_id A_value
0 0 5
1 1 15
2 2 25
3 3 35
4 4 45
5 5 55
6 6 65
7 7 75
8 8 85
9 9 95
B
B_high B_id B_low
0 10 0 0
1 40 1 30
2 50 2 30
3 54 3 46
4 84 4 84
numpy
The ✌easiest✌ way is to use numpy broadcasting.
We look for every instance of A_value being greater than or equal to B_low while at the same time A_value is less than or equal to B_high.
a = A.A_value.values
bh = B.B_high.values
bl = B.B_low.values
i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))
pd.concat([
A.loc[i, :].reset_index(drop=True),
B.loc[j, :].reset_index(drop=True)
], axis=1)
A_id A_value B_high B_id B_low
0 0 5 10 0 0
1 3 35 40 1 30
2 3 35 50 2 30
3 4 45 50 2 30
To address the comments and give something akin to a left join, I appended the part of A that doesn't match.
pd.concat([
A.loc[i, :].reset_index(drop=True),
B.loc[j, :].reset_index(drop=True)
], axis=1).append(
A[~np.in1d(np.arange(len(A)), np.unique(i))],
ignore_index=True, sort=False
)
A_id A_value B_id B_low B_high
0 0 5 0.0 0.0 10.0
1 3 35 1.0 30.0 40.0
2 3 35 2.0 30.0 50.0
3 4 45 2.0 30.0 50.0
4 1 15 NaN NaN NaN
5 2 25 NaN NaN NaN
6 5 55 NaN NaN NaN
7 6 65 NaN NaN NaN
8 7 75 NaN NaN NaN
9 8 85 NaN NaN NaN
10 9 95 NaN NaN NaN
Not sure that is more efficient, however you can use sql directly (from the module sqlite3 for instance) with pandas (inspired from this question) like:
conn = sqlite3.connect(":memory:")
df2 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1.to_sql("df1", conn, index=False)
df2.to_sql("df2", conn, index=False)
qry = "SELECT * FROM df1, df2 WHERE df1.col1 > 0 and df1.col1<0.5"
tt = pd.read_sql_query(qry,conn)
You can adapt the query as needed in your application
I don't know how efficient it is, but someone wrote a wrapper that allows you to use SQL syntax with pandas objects. That's called pandasql. The documentation explicitly states that joins are supported. This might be at least easier to read since SQL syntax is very readable.
conditional_join from pyjanitor may be helpful in the abstraction/convenience;:
# pip install pyjanitor
import pandas as pd
import janitor
inner join
A.conditional_join(B,
('A_value', 'B_low', '>='),
('A_value', 'B_high', '<=')
)
A_id A_value B_id B_low B_high
0 0 5 0 0 10
1 3 35 1 30 40
2 3 35 2 30 50
3 4 45 2 30 50
left join
A.conditional_join(
B,
('A_value', 'B_low', '>='),
('A_value', 'B_high', '<='),
how = 'left'
)
A_id A_value B_id B_low B_high
0 0 5 0.0 0.0 10.0
1 1 15 NaN NaN NaN
2 2 25 NaN NaN NaN
3 3 35 1.0 30.0 40.0
4 3 35 2.0 30.0 50.0
5 4 45 2.0 30.0 50.0
6 5 55 NaN NaN NaN
7 6 65 NaN NaN NaN
8 7 75 NaN NaN NaN
9 8 85 NaN NaN NaN
10 9 95 NaN NaN NaN
lets take a simple example:
df=pd.DataFrame([2,3,4,5,6],columns=['A'])
returns
A
0 2
1 3
2 4
3 5
4 6
now lets define a second dataframe
df2=pd.DataFrame([1,6,2,3,5],columns=['B_low'])
df2['B_high']=[2,8,4,6,6]
results in
B_low B_high
0 1 2
1 6 8
2 2 4
3 3 6
4 5 6
here we go; and we want output to be index 3 and A value 5
df.where(df['A']>=df2['B_low']).where(df['A']<df2['B_high']).dropna()
results in
A
3 5.0
I know this is an old question but for newcomers there is now the pandas.merge_asof function that performs join based on closest match.
In case you want to do a merge so that a column of one DataFrame (df_right) is between 2 columns of another DataFrame (df_left) you can do the following:
df_left = pd.DataFrame({
"time_from": [1, 4, 10, 21],
"time_to": [3, 7, 15, 27]
})
df_right = pd.DataFrame({
"time": [2, 6, 16, 25]
})
df_left
time_from time_to
0 1 3
1 4 7
2 10 15
3 21 27
df_right
time
0 2
1 6
2 16
3 25
First, find matches of the right DataFrame that are closest but largest than the left boundary (time_from) of the left DataFrame:
merged = pd.merge_asof(
left=df_1,
right=df_2.rename(columns={"time": "candidate_match_1"}),
left_on="time_from",
right_on="candidate_match_1",
direction="forward"
)
merged
time_from time_to candidate_match_1
0 1 3 2
1 4 7 6
2 10 15 16
3 21 27 25
As you can see the candidate match in index 2 is wrongly matched, as 16 is not between 10 and 15.
Then, find matches of the right DataFrame that are closest but smaller than the right boundary (time_to) of the left DataFrame:
merged = pd.merge_asof(
left=merged,
right=df_2.rename(columns={"time": "candidate_match_2"}),
left_on="time_to",
right_on="candidate_match_2",
direction="backward"
)
merged
time_from time_to candidate_match_1 candidate_match_2
0 1 3 2 2
1 4 7 6 6
2 10 15 16 6
3 21 27 25 25
Finally, keep the matches where the candidate matches are the same, meaning that the value of the right DataFrame are between values of the 2 columns of the left DataFrame:
merged["match"] = None
merged.loc[merged["candidate_match_1"] == merged["candidate_match_2"], "match"] = \
merged.loc[merged["candidate_match_1"] == merged["candidate_match_2"], "candidate_match_1"]
merged
time_from time_to candidate_match_1 candidate_match_2 match
0 1 3 2 2 2
1 4 7 6 6 6
2 10 15 16 6 None
3 21 27 25 25 25

Pandas dataframe groupby using cyclical data

I have some pricing data that looks like this:
import pandas as pd
df=pd.DataFrame([['A','1', 2015-02-01, 20.00, 20.00, 5],
['A','1', 2015-02-06, 16.00, 20.00, 8],
['A','1', 2015-02-14, 14.00, 20.00, 34],
['A','1', 2015-03-20, 20.00, 20.00, 5],
['A','1', 2015-03-25, 15.00, 20.00, 15],
['A','2', 2015-02-01, 75.99, 100.00, 22],
['A','2', 2015-02-23, 100.00, 100.00, 30],
['A','2', 2015-03-25, 65.00, 100.00, 64],
['B','3', 2015-04-01, 45.00, 45.00, 15],
['B','3', 2015-04-16, 40.00, 45.00, 2],
['B','3', 2015-04-18, 45.00, 45.00, 30],
['B','4', 2015-07-25, 5.00, 10.00, 55]],
columns=['dept','sku', 'date', 'price', 'orig_price', 'days_at_price'])
print(df)
dept sku date price orig_price days_at_price
0 A 1 2015-02-01 20.00 20.00 5
1 A 1 2015-02-06 16.00 20.00 8
2 A 1 2015-02-14 14.00 20.00 34
3 A 1 2015-03-20 20.00 20.00 5
4 A 1 2015-03-25 15.00 20.00 15
5 A 2 2015-02-01 75.99 100.00 22
6 A 2 2015-02-23 100.00 100.00 30
7 A 2 2015-03-25 65.00 100.00 64
8 B 3 2015-04-01 45.00 45.00 15
9 B 3 2015-04-16 40.00 45.00 2
10 B 3 2015-04-18 45.00 45.00 30
11 B 4 2015-07-25 5.00 10.00 55
I want to describe the pricing cycles, which can be defined as the period when a sku goes from original price to promotional price (or multiple promotional prices) and returns to original. A cycle must start with the original price. It is okay to include cycles which never change in price, as well as those that are reduced and never return. But an initial price that is less than orig_price would not be counted as a cycle. For the above df, the result I am looking for is:
dept sku cycle orig_price_days promo_days
0 A 1 1 5 42
1 A 1 2 5 15
2 A 2 1 30 64
3 B 3 1 15 2
4 B 3 2 30 0
I played around with groupby and sum, but can't quite figure out how to define a cycle and total the rows accordingly. Any help would be greatly appreciated.
I got very close to producing the desired end result...
# add a column to track whether price is above/below/equal to orig
df.loc[:,'reg'] = np.sign(df.price - df.orig_price)
# remove row where first known price for sku is promo
df_gp = df.groupby(['dept', 'sku'])
df = df[~((df_gp.cumcount() == 0) & (df.reg == -1))]
# enumerate all the individual pricing cycles
df.loc[:,'cycle'] = (df.reg == 0).cumsum()
# group/aggregate to get days at orig vs. promo pricing
cycles = df.groupby(['dept', 'sku', 'cycle'])['days_at_price'].agg({'promo_days': lambda x: x[1:].sum(), 'reg_days':lambda x: x[:1].sum()})
print cycles.reset_index()
dept sku cycle reg_days promo_days
0 A 1 1 5 42
1 A 1 2 5 15
2 A 2 3 30 64
3 B 3 4 15 2
4 B 3 5 30 0
The only part that I couldn't quite crack was how to restart the cycle number for each sku before the groupby.
Try using loc instead of groupby - you want chunks of skus over time periods, not aggregated groups. A for-loop, used in moderation, can also help here and won't be particularly un-pandas like. (At least if, like me, you consider looping over unique array slices to be fine.)
df['cycle'] = -1 # create a column for the cycle
skus = df.sku.unique() # get unique skus for iteration
for sku in skus:
# Get the start date for each cycle for this sku
# NOTE that we define cycles as beginning
# when the price equals the original price
# This avoids the mentioned issue that a cycle should not start
# if initial is less than original.
cycle_start_dates = df.loc[(df.sku == sku]) & \
(df.price == df.orig_price),
'date'].tolist()
# append a terminal date
cycle_start_dates.append(df.date.max()+timedelta(1))
# Assign the cycle values
for i in range(len(cycle_start_dates) - 1):
df.loc[(df.sku == sku) & \
(cycle_start_dates[i] <= df.date) & \
(df.date < cycle_start_dates[i+1]), 'cycle'] = i+1
This should give you a column with all of the cycles for each sku:
dept sku date price orig_price days_at_price cycle
0 A 1 2015-02-01 20.00 20.0 5 1
1 A 1 2015-02-06 16.00 20.0 8 1
2 A 1 2015-02-14 14.00 20.0 34 1
3 A 1 2015-03-20 20.00 20.0 5 2
4 A 1 2015-03-25 15.00 20.0 15 2
5 A 2 2015-02-01 75.99 100.0 22 1
6 A 2 2015-02-23 100.00 100.0 30 1
7 A 2 2015-03-25 65.00 100.0 64 2
8 B 3 2015-04-01 45.00 45.0 15 2
9 B 3 2015-04-16 40.00 45.0 2 2
10 B 3 2015-04-18 45.00 45.0 30 2
11 B 4 2015-07-25 5.00 10.0 55 2
Once you have the cycle column, aggregation becomes relatively straightforward. This multiple aggregation:
df.groupby(['dept', 'sku','cycle'])['days_at_price']\
.agg({'orig_price_days': lambda x: x[:1].sum(),
'promo_days': lambda x: x[1:].sum()
})\
.reset_index()
will give you the desired result:
dept sku cycle promo_days orig_price_days
0 A 1 1 42 5
1 A 1 2 15 5
2 A 2 -1 0 22
3 A 2 1 64 30
4 B 3 1 2 15
5 B 3 2 0 30
6 B 4 -1 0 55
Note that this has additional -1 values for cycle for pre-cycle, below original pricing.

Categories