This question is not same as pandas every nth row or every n row,please don't delete it.
Following are some rows of my table:
open high low close volume datetime
277.14 277.51 276.71 276.8799 968908 2020-04-13 08:31:00.000
245.3 246.06 245.2 246.01 1094537 2020-04-13 08:32:00.000
285.12 285.27 284.81 285.22 534427 2020-04-13 08:33:00.000
246.08 246.08 245.27 245.46 1333257 2020-04-13 08:34:00.000
291.71 291.73 291.08 291.28 1439183 2020-04-13 08:35:00.000
245.89 246.63 245.64 246.25 960411 2020-04-13 08:36:00.000
285.18 285.4 285 285.36 188531 2020-04-13 08:30:37.000
285.79 285.79 285.65 285.68 6251 2020-04-13 08:38:00.000
246.25 246.56 246.12 246.515 956339 2020-04-13 08:39:00.000
I want to get every 3 rows,and for exmaple,
the 1st time get : 1st,2end,3rd rows,
2end time get : 2end,3rd,4th rows,
3rd time get : 3rd,4th,5th rows,
4th time get : 4th,5th,6th rows.
Any good way that I can use pandas or python to get this.Thanks.
Use generator with iloc to select the desire rows:
def rows_generator(df):
i = 0
while (i+3) <= df.shape[0]:
yield df.iloc[i:(i+3):1, :]
i += 1
i = 1
for df in rows_generator(df):
print(f'Time #{i}')
print(df)
i += 1
Example output:
Time #1
Group Cat Value
0 Group1 Cat1 1230
1 Group2 Cat2 4019
2 Group3 Cat3 9491
Time #2
Group Cat Value
1 Group2 Cat2 4019
2 Group3 Cat3 9491
3 Group4 Cat4 9588
Time #3
Group Cat Value
2 Group3 Cat3 9491
3 Group4 Cat4 9588
4 Group5 Cat5 6402
Time #4
Group Cat Value
3 Group4 Cat4 9588
4 Group5 Cat5 6402
5 Group6 Cat 1923
Time #5
Group Cat Value
4 Group5 Cat5 6402
5 Group6 Cat 1923
6 Group7 Cat7 492
Time #6
Group Cat Value
5 Group6 Cat 1923
6 Group7 Cat7 492
7 Group8 Cat8 8589
Time #7
Group Cat Value
6 Group7 Cat7 492
7 Group8 Cat8 8589
8 Group9 Cat9 8582
Does .shift() do what you want?
import pandas as pd
df = pd.DataFrame({'w': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]})
df['x'] = df['w'].shift( 0)
df['y'] = df['w'].shift(-1)
df['z'] = df['w'].shift(-2)
print(df)
w x y z
0 10 10 20.0 30.0
1 20 20 30.0 40.0
2 30 30 40.0 50.0
3 40 40 50.0 60.0
4 50 50 60.0 70.0
5 60 60 70.0 80.0
6 70 70 80.0 90.0
7 80 80 90.0 100.0
8 90 90 100.0 NaN
9 100 100 NaN NaN
The following should work:
for i in range(len(df)-2):
result=df.iloc[i:i+3, :]
Related
How can I find the average value of n largest values in a month, but day has to be unique?
I do have a timestamp column as well, but I would guess making columns of them is the way to go?
I tried df['peak_avg'] = df.groupby(['month', 'day'])['value'].transform(lambda x: x.nlargest(3).mean()), but this takes the average of the three largest days.
month
day
value
peak_avg (expected)
1
1
35
35
1
1
30
35
2
1
34
28.5
2
2
23
28.5
3
1
98
97
3
2
96.
97
IIUC, you can drop the duplicate in month and day columns and at last fill them
df['peak_avg'] = (df.sort_values(['month', 'day', 'value'], ascending=[True, True, False])
.drop_duplicates(['month', 'day'])
.groupby(['month'])['value']
.transform(lambda x: x.head(3).mean()))
df['peak_avg'] = df.groupby(['month', 'day'])['peak_avg'].apply(lambda g: g.ffill().bfill())
print(df)
month day value peak_avg
0 1 1 35 35.0
1 1 1 12 35.0
2 2 1 34 28.5
3 2 3 23 28.5
4 3 1 98 98.0
5 3 2 98 98.0
You can first derive max value for a day and you should group only by month, since you want to take average of month.
df['max_value'] = df.groupby(['month', 'day']).value.transform(max)
df['peak_avg'] = df.groupby('month').value.transform(lambda x: x.nlargest(3).mean())
Solution
Pivot the dataframe with aggfunc max, then sort and select the top three columns and use nanmean along columns axis to calculate average
s = df.pivot_table('value', 'month', 'day', 'max')
s['avg'] = np.nanmean(np.sort(-s, 1)[:, :3] * -1, 1)
df['avg'] = df['month'].map(s['avg'])
month day value peak_avg (expected) avg
0 1 1 35.0 35.0 35.0
1 1 1 30.0 35.0 35.0
2 2 1 34.0 28.5 28.5
3 2 2 23.0 28.5 28.5
4 3 1 98.0 97.0 97.0
5 3 2 96.0 97.0 97.0
Say I have a vector ValsHR which looks like this:
valsHR=[78.8, 82.3, 91.0]
And I have a dataframe MainData
Age Patient HR
21 1 NaN
21 1 NaN
21 1 NaN
30 2 NaN
30 2 NaN
24 3 NaN
24 3 NaN
24 3 NaN
I want to fill the NaNs so that the first value in valsHR will only fill in the NaNs for patient 1, the second will fill the NaNs for patient 2 and the third will fill in for patient 3.
So far I've tried using this:
mainData['HR'] = mainData['HR'].fillna(ValsHR) but it fills all the NaNs with the first value in the vector.
I've also tried to use this:
mainData['HR'] = mainData.groupby('Patient').fillna(ValsHR) fills the NaNs with values that aren't in the valsHR vector at all.
I was wondering if anyone knew a way to do this?
Create dictionary by Patient values with missing values, map to original column and replace missing values only:
print (df)
Age Patient HR
0 21 1 NaN
1 21 1 NaN
2 21 1 NaN
3 30 2 100.0 <- value is not replaced
4 30 2 NaN
5 24 3 NaN
6 24 3 NaN
7 24 3 NaN
p = df.loc[df.HR.isna(), 'Patient'].unique()
valsHR = [78.8, 82.3, 91.0]
df['HR'] = df['HR'].fillna(df['Patient'].map(dict(zip(p, valsHR))))
print (df)
Age Patient HR
0 21 1 78.8
1 21 1 78.8
2 21 1 78.8
3 30 2 100.0
4 30 2 82.3
5 24 3 91.0
6 24 3 91.0
7 24 3 91.0
If some groups has no NaNs:
print (df)
Age Patient HR
0 21 1 NaN
1 21 1 NaN
2 21 1 NaN
3 30 2 100.0 <- group 2 is not replaced
4 30 2 100.0 <- group 2 is not replaced
5 24 3 NaN
6 24 3 NaN
7 24 3 NaN
p = df.loc[df.HR.isna(), 'Patient'].unique()
valsHR = [78.8, 82.3, 91.0]
df['HR'] = df['HR'].fillna(df['Patient'].map(dict(zip(p, valsHR))))
print (df)
Age Patient HR
0 21 1 78.8
1 21 1 78.8
2 21 1 78.8
3 30 2 100.0
4 30 2 100.0
5 24 3 82.3
6 24 3 82.3
7 24 3 82.3
It is simply mapping, if all of NaN should be replaced
import pandas as pd
from io import StringIO
valsHR=[78.8, 82.3, 91.0]
vals = {i:k for i,k in enumerate(valsHR, 1)}
df = pd.read_csv(StringIO("""Age Patient
21 1
21 1
21 1
30 2
30 2
24 3
24 3
24 3"""), sep="\s+")
df["HR"] = df["Patient"].map(vals)
>>> df
Age Patient HR
0 21 1 78.8
1 21 1 78.8
2 21 1 78.8
3 30 2 82.3
4 30 2 82.3
5 24 3 91.0
6 24 3 91.0
7 24 3 91.0
My goal today is to follow each ID that belongs to Category==1 in a given date, one year later. So I have a dataframe like this:
Period ID Amount Category
20130101 1 100 1
20130101 2 150 1
20130101 3 100 1
20130201 1 90 1
20130201 2 140 1
20130201 3 95 1
20130201 5 250 0
. . .
20140101 1 40 1
20140101 2 70 1
20140101 5 160 0
20140201 1 35 1
20140201 2 65 1
20140201 5 150 0
For example, in 20130201 I have 2 ID's that belong to Category 1: 1,2,3, but just 2 of them are present in 20140201: 1,2. So I need to get the value of Amount, only for those ID's, one year later, like this:
Period ID Amount Category Amount_t1
20130101 1 100 1 40
20130101 2 150 1 70
20130101 3 100 1 nan
20130201 1 90 1 35
20130201 2 140 1 65
20130201 3 95 1 nan
20130201 5 250 0 nan
. . .
20140101 1 40 1 nan
20140101 2 70 1 nan
20140101 5 160 0 nan
20140201 1 35 1 nan
20140201 2 65 1 nan
20140201 5 150 0 nan
So, if the ID doesn't appear next year or belong to Category 0, I'll get a nan. My first approach was to get the list of unique ID's on each Period and then trying to map that to the next year, using some sort of combination of groupby() and isin() like this:
aux = df[df.Category==1].groupby('Period').ID.unique()
aux.index = aux.index + pd.DateOffset(years=1)
But I didn't know how to keep going. I'm thinking some kind of groupby('ID') might be more efficient too. If it were a simple shift() that would be easy, but I'm not sure about how to get the value offset by a year by group.
You can create lagged features with an exact merge after you manually lag one of the join keys.
import pandas as pd
# Datetime so we can do calendar year subtraction
df['Period'] = pd.to_datetime(df.Period, format='%Y%m%d')
# Create one with the lagged features. Here I'll split the steps out.
df2 = df.copy()
df2['Period'] = df2.Period-pd.offsets.DateOffset(years=1) # 1 year lag
df2 = df2.rename(columns={'Amount': 'Amount_t1'})
# Keep only values you want to merge
df2 = df2[df2.Category.eq(1)]
# Bring lagged features
df.merge(df2, on=['Period', 'ID', 'Category'], how='left')
Period ID Amount Category Amount_t1
0 2013-01-01 1 100 1 40.0
1 2013-01-01 2 150 1 70.0
2 2013-01-01 3 100 1 NaN
3 2013-02-01 1 90 1 35.0
4 2013-02-01 2 140 1 65.0
5 2013-02-01 3 95 1 NaN
6 2013-02-01 5 250 0 NaN
7 2014-01-01 1 40 1 NaN
8 2014-01-01 2 70 1 NaN
9 2014-01-01 5 160 0 NaN
10 2014-02-01 1 35 1 NaN
11 2014-02-01 2 65 1 NaN
12 2014-02-01 5 150 0 NaN
dataframe = pd.DataFrame(data={'user': [1,1,1,1,1,2,2,2,2,2], 'usage':
[12,18,76,32,43,45,19,42,9,10]})
dataframe['mean'] = dataframe.groupby('user'['usage'].apply(pd.rolling_mean, 2))
Why this code is not working?
i am getting an error of rolling mean attribute is not found in pandas
Use groupby with rolling, docs:
dataframe['mean'] = (dataframe.groupby('user')['usage']
.rolling(2)
.mean()
.reset_index(level=0, drop=True))
print (dataframe)
user usage mean
0 1 12 NaN
1 1 18 15.0
2 1 76 47.0
3 1 32 54.0
4 1 43 37.5
5 2 45 NaN
6 2 19 32.0
7 2 42 30.5
8 2 9 25.5
9 2 10 9.5
I have some pricing data that looks like this:
import pandas as pd
df=pd.DataFrame([['A','1', 2015-02-01, 20.00, 20.00, 5],
['A','1', 2015-02-06, 16.00, 20.00, 8],
['A','1', 2015-02-14, 14.00, 20.00, 34],
['A','1', 2015-03-20, 20.00, 20.00, 5],
['A','1', 2015-03-25, 15.00, 20.00, 15],
['A','2', 2015-02-01, 75.99, 100.00, 22],
['A','2', 2015-02-23, 100.00, 100.00, 30],
['A','2', 2015-03-25, 65.00, 100.00, 64],
['B','3', 2015-04-01, 45.00, 45.00, 15],
['B','3', 2015-04-16, 40.00, 45.00, 2],
['B','3', 2015-04-18, 45.00, 45.00, 30],
['B','4', 2015-07-25, 5.00, 10.00, 55]],
columns=['dept','sku', 'date', 'price', 'orig_price', 'days_at_price'])
print(df)
dept sku date price orig_price days_at_price
0 A 1 2015-02-01 20.00 20.00 5
1 A 1 2015-02-06 16.00 20.00 8
2 A 1 2015-02-14 14.00 20.00 34
3 A 1 2015-03-20 20.00 20.00 5
4 A 1 2015-03-25 15.00 20.00 15
5 A 2 2015-02-01 75.99 100.00 22
6 A 2 2015-02-23 100.00 100.00 30
7 A 2 2015-03-25 65.00 100.00 64
8 B 3 2015-04-01 45.00 45.00 15
9 B 3 2015-04-16 40.00 45.00 2
10 B 3 2015-04-18 45.00 45.00 30
11 B 4 2015-07-25 5.00 10.00 55
I want to describe the pricing cycles, which can be defined as the period when a sku goes from original price to promotional price (or multiple promotional prices) and returns to original. A cycle must start with the original price. It is okay to include cycles which never change in price, as well as those that are reduced and never return. But an initial price that is less than orig_price would not be counted as a cycle. For the above df, the result I am looking for is:
dept sku cycle orig_price_days promo_days
0 A 1 1 5 42
1 A 1 2 5 15
2 A 2 1 30 64
3 B 3 1 15 2
4 B 3 2 30 0
I played around with groupby and sum, but can't quite figure out how to define a cycle and total the rows accordingly. Any help would be greatly appreciated.
I got very close to producing the desired end result...
# add a column to track whether price is above/below/equal to orig
df.loc[:,'reg'] = np.sign(df.price - df.orig_price)
# remove row where first known price for sku is promo
df_gp = df.groupby(['dept', 'sku'])
df = df[~((df_gp.cumcount() == 0) & (df.reg == -1))]
# enumerate all the individual pricing cycles
df.loc[:,'cycle'] = (df.reg == 0).cumsum()
# group/aggregate to get days at orig vs. promo pricing
cycles = df.groupby(['dept', 'sku', 'cycle'])['days_at_price'].agg({'promo_days': lambda x: x[1:].sum(), 'reg_days':lambda x: x[:1].sum()})
print cycles.reset_index()
dept sku cycle reg_days promo_days
0 A 1 1 5 42
1 A 1 2 5 15
2 A 2 3 30 64
3 B 3 4 15 2
4 B 3 5 30 0
The only part that I couldn't quite crack was how to restart the cycle number for each sku before the groupby.
Try using loc instead of groupby - you want chunks of skus over time periods, not aggregated groups. A for-loop, used in moderation, can also help here and won't be particularly un-pandas like. (At least if, like me, you consider looping over unique array slices to be fine.)
df['cycle'] = -1 # create a column for the cycle
skus = df.sku.unique() # get unique skus for iteration
for sku in skus:
# Get the start date for each cycle for this sku
# NOTE that we define cycles as beginning
# when the price equals the original price
# This avoids the mentioned issue that a cycle should not start
# if initial is less than original.
cycle_start_dates = df.loc[(df.sku == sku]) & \
(df.price == df.orig_price),
'date'].tolist()
# append a terminal date
cycle_start_dates.append(df.date.max()+timedelta(1))
# Assign the cycle values
for i in range(len(cycle_start_dates) - 1):
df.loc[(df.sku == sku) & \
(cycle_start_dates[i] <= df.date) & \
(df.date < cycle_start_dates[i+1]), 'cycle'] = i+1
This should give you a column with all of the cycles for each sku:
dept sku date price orig_price days_at_price cycle
0 A 1 2015-02-01 20.00 20.0 5 1
1 A 1 2015-02-06 16.00 20.0 8 1
2 A 1 2015-02-14 14.00 20.0 34 1
3 A 1 2015-03-20 20.00 20.0 5 2
4 A 1 2015-03-25 15.00 20.0 15 2
5 A 2 2015-02-01 75.99 100.0 22 1
6 A 2 2015-02-23 100.00 100.0 30 1
7 A 2 2015-03-25 65.00 100.0 64 2
8 B 3 2015-04-01 45.00 45.0 15 2
9 B 3 2015-04-16 40.00 45.0 2 2
10 B 3 2015-04-18 45.00 45.0 30 2
11 B 4 2015-07-25 5.00 10.0 55 2
Once you have the cycle column, aggregation becomes relatively straightforward. This multiple aggregation:
df.groupby(['dept', 'sku','cycle'])['days_at_price']\
.agg({'orig_price_days': lambda x: x[:1].sum(),
'promo_days': lambda x: x[1:].sum()
})\
.reset_index()
will give you the desired result:
dept sku cycle promo_days orig_price_days
0 A 1 1 42 5
1 A 1 2 15 5
2 A 2 -1 0 22
3 A 2 1 64 30
4 B 3 1 2 15
5 B 3 2 0 30
6 B 4 -1 0 55
Note that this has additional -1 values for cycle for pre-cycle, below original pricing.