I have a dataframe that has segments of consecutive values appearing in column a (the value in column b does not matter):
import pandas as pd
import numpy as np
np.random.seed(150)
df = pd.DataFrame(data={'a':[1,2,3,4,5,15,16,17,18,203,204,205],'b':np.random.randint(50000,size=(12))})
>>> df
a b
0 1 27066
1 2 28155
2 3 49177
3 4 496
4 5 2354
5 15 23292
6 16 9358
7 17 19036
8 18 29946
9 203 39785
10 204 15843
11 205 21917
I would like to add a column c whose values are sequential counts according to presenting consecutive values in column a, as shown below:
a b c
1 27066 1
2 28155 2
3 49177 3
4 496 4
5 2354 5
15 23292 1
16 9358 2
17 19036 3
18 29946 4
203 39785 1
204 15843 2
205 21917 3
How to do this?
One solution:
df["c"] = (s := df["a"] - np.arange(len(df))).groupby(s).cumcount() + 1
print(df)
Output
a b c
0 1 27066 1
1 2 28155 2
2 3 49177 3
3 4 496 4
4 5 2354 5
5 15 23292 1
6 16 9358 2
7 17 19036 3
8 18 29946 4
9 203 39785 1
10 204 15843 2
11 205 21917 3
The original idea comes from ancient Python docs.
In order to use the walrus operator ((:=) or assignment expressions) you need Python 3.8+, instead you can do:
s = df["a"] - np.arange(len(df))
df["c"] = s.groupby(s).cumcount() + 1
print(df)
A simple solution is to find consecutive groups, use cumsum to get the number sequence and then remove any extra in later groups.
a = df['a'].add(1).shift(1).eq(df['a'])
df['c'] = a.cumsum() - a.cumsum().where(~a).ffill().fillna(0).astype(int) + 1
df
Result:
a b c
0 1 27066 1
1 2 28155 2
2 3 49177 3
3 4 496 4
4 5 2354 5
5 15 23292 1
6 16 9358 2
7 17 19036 3
8 18 29946 4
9 203 39785 1
10 204 15843 2
11 205 21917 3
I need to add some values to a dataframe based on the ID and DATE_TWO columns. In the case when DATE_TWO >= DATE_ONE then fill in any subsequent DATE_TWO values for that ID with the first DATE_TWO value. Here is the original dataframe:
ID
EVENT
DATE_ONE
DATE_TWO
1
13
3/1/2021
1
20
3/5/2021
3/5/2021
1
32
3/6/2021
1
43
3/7/2021
2
1
3/3/2021
2
2
4/5/2021
3
1
3/1/2021
3
12
3/7/2021
3/7/2021
3
13
3/9/2021
3
15
3/14/2021
Here is what the table after transformation:
ID
EVENT
DATE_ONE
DATE_TWO
1
13
3/1/2021
1
20
3/5/2021
3/5/2021
1
32
3/6/2021
3/5/2021
1
43
3/7/2021
3/5/2021
2
1
3/3/2021
2
2
4/5/2021
3
1
3/1/2021
3
12
3/7/2021
3/7/2021
3
13
3/9/2021
3/7/2021
3
15
3/14/2021
3/7/2021
This could be done with a for loop, but I know in python - particularly with dataframes - for loops can be slow. Is there some other more python and computationally speedy way to accomplish what I am seeking?
data = {'ID': [1,1,1,1,2,2,3,3,3,3],
'EVENT': [12, 20, 32, 43,1,2,1,12,13,15],
'DATE_ONE': ['3/1/2021','3/5/2021','3/6/2021','3/7/2021','3/3/2021','4/5/2021',
'3/1/2021','3 /7/2021','3/9/2021','3/14/2021'],
'DATE_TWO': ['','3/5/2021','','','','','','3/7/2021','','']}
I slightly changed your data so we can see how it works.
Data
import pandas as pd
import numpy as np
data = {'ID': [1,1,1,1,2,2,3,3,3,3],
'EVENT': [12, 20, 32, 43,1,2,1,12,13,15],
'DATE_ONE': ['3/1/2021','3/5/2021','3/6/2021','3/7/2021','3/3/2021','4/5/2021',
'3/1/2021','3 /7/2021','3/9/2021','3/14/2021'],
'DATE_TWO': ['','3/5/2021','','','','','3/7/2021','','3/7/2021','']}
df = pd.DataFrame(data)
df["DATE_ONE"] = pd.to_datetime(df["DATE_ONE"])
df["DATE_TWO"] = pd.to_datetime(df["DATE_TWO"])
# We better sort DATE_ONE
df = df.sort_values(["ID", "DATE_ONE"]).reset_index(drop=True)
FILL with condition
df["COND"] = np.where(df["DATE_ONE"].le(df["DATE_TWO"]).eq(True),
1,
np.where(df["DATE_TWO"].notnull() &
df["DATE_ONE"].gt(df["DATE_TWO"]),
0,
np.nan))
grp = df.groupby("ID")
df["COND"] = grp["COND"].fillna(method='ffill').fillna(0)
df["FILL"] = grp["DATE_TWO"].fillna(method='ffill')
df["DATE_TWO"] = np.where(df["COND"].eq(1), df["FILL"], df["DATE_TWO"])
df = df.drop(columns=["COND", "FILL"])
ID EVENT DATE_ONE DATE_TWO
0 1 12 2021-03-01 NaT
1 1 20 2021-03-05 2021-03-05
2 1 32 2021-03-06 2021-03-05
3 1 43 2021-03-07 2021-03-05
4 2 1 2021-03-03 NaT
5 2 2 2021-04-05 NaT
6 3 1 2021-03-01 2021-03-07
7 3 12 2021-03-07 2021-03-07
8 3 13 2021-03-09 2021-03-07
9 3 15 2021-03-14 NaT
I have a pandas multiindex with two indices, a data and a gender columns. It looks like this:
Division North South West East
Date Gender
2016-05-16 19:00:00 F 0 2 3 3
M 12 15 12 12
2016-05-16 20:00:00 F 12 9 11 11
M 10 13 8 9
2016-05-16 21:00:00 F 9 4 7 1
M 5 1 12 10
Now if I want to find the average values for each hour, I know I can do like:
df.groupby(df.index.hour).mean()
but this does not seem to work when you have a multi index. I found that I could do reach the Date index like:
df.groupby(df.index.get_level_values('Date').hour).mean()
which sort of averages over the 24 hours in a day, but I loose track of the Gender index...
so my question is: how can I find the average hourly values for each Division by Gender?
I think you can add level of MultiIndex, need pandas 0.20.1+:
df1 = df.groupby([df.index.get_level_values('Date').hour,'Gender']).mean()
print (df1)
North South West East
Date Gender
19 F 0 2 3 3
M 12 15 12 12
20 F 12 9 11 11
M 10 13 8 9
21 F 9 4 7 1
M 5 1 12 10
Another solution:
df1 = df.groupby([df.index.get_level_values('Date').hour,
df.index.get_level_values('Gender')]).mean()
print (df1)
North South West East
Date Gender
19 F 0 2 3 3
M 12 15 12 12
20 F 12 9 11 11
M 10 13 8 9
21 F 9 4 7 1
M 5 1 12 10
Or simply create columns from MultiIndex:
df = df.reset_index()
df1 = df.groupby([df['Date'].dt.hour, 'Gender']).mean()
print (df1)
North South West East
Date Gender
19 F 0 2 3 3
M 12 15 12 12
20 F 12 9 11 11
M 10 13 8 9
21 F 9 4 7 1
M 5 1 12 10
I'm trying to get all records where the mean of the last 3 rows is greater than the overall mean for all rows in a filtered set.
_filtered_d_all = _filtered_d.iloc[:, 0:50].loc[:, _filtered_d.mean()>0.05]
_last_n_records = _filtered_d.tail(3)
Something like this
_filtered_growing = _filtered_d.iloc[:, 0:50].loc[:, _last_n_records.mean() > _filtered_d.mean()]
However, the problem here is that the value length is incorrect. Any tips?
ValueError: Series lengths must match to compare
Sample Data
This has an index on the year and month, and 2 columns.
Col1 Col2
year month
2005 12 0.533835 0.170679
12 0.494733 0.198347
2006 3 0.440098 0.202240
6 0.410285 0.188421
9 0.502420 0.200188
12 0.522253 0.118680
2007 3 0.378120 0.171192
6 0.431989 0.145158
9 0.612036 0.178097
12 0.519766 0.252196
2008 3 0.547705 0.202163
6 0.560985 0.238591
9 0.617320 0.199537
12 0.343939 0.253855
Why not just boolean index directly on your filtered DataFrame with
df[df.tail(3).mean() > df.mean()]
Demo
>>> df
0 1 2 3 4
0 4 8 2 4 6
1 0 0 0 2 8
2 5 3 0 9 3
3 7 5 5 1 2
4 9 7 8 9 4
>>> df[df.tail(3).mean() > df.mean()]
0 1 2 3 4
0 4 8 2 4 6
1 0 0 0 2 8
2 5 3 0 9 3
3 7 5 5 1 2
Update example for MultiIndex edit
The same should work fine for your MultiIndex sample, we just have to mask a bit differently of course.
>>> df
col1 col2
2005 12 -0.340088 -0.574140
12 -0.814014 0.430580
2006 3 0.464008 0.438494
6 0.019508 -0.635128
9 0.622645 -0.824526
12 -1.674920 -1.027275
2007 3 0.397133 0.659467
6 0.026170 -0.052063
9 0.835561 0.608067
12 0.736873 -0.613877
2008 3 0.344781 -0.566392
6 -0.653290 -0.264992
9 0.080592 -0.548189
12 0.585642 1.149779
>>> df.loc[:,df.tail(3).mean() > df.mean()]
col2
2005 12 -0.574140
12 0.430580
2006 3 0.438494
6 -0.635128
9 -0.824526
12 -1.027275
2007 3 0.659467
6 -0.052063
9 0.608067
12 -0.613877
2008 3 -0.566392
6 -0.264992
9 -0.548189
12 1.149779
I'm using Pandas to store stock prices data using Data Frames. There are 2940 rows in the dataset. The Dataset snapshot is displayed below:
The time series data does not contain the values for Saturday and Sunday. Hence missing values have to be filled.
Here is the code I've written but it is not solving the problem:
import pandas as pd
import numpy as np
import os
os.chdir('C:/Users/Admin/Analytics/stock-prices')
data = pd.read_csv('stock-data.csv')
# PriceDate Column - Does not contain Saturday and Sunday stock entries
data['PriceDate'] = pd.to_datetime(data['PriceDate'], format='%m/%d/%Y')
data = data.sort_index(by=['PriceDate'], ascending=[True])
# Starting date is Aug 25 2004
idx = pd.date_range('08-25-2004',periods=2940,freq='D')
data = data.set_index(idx)
data['newdate']=data.index
newdate=data['newdate'].values # Create a time series column
data = pd.merge(newdate, data, on='PriceDate', how='outer')
How to fill the missing values for Saturday and Sunday?
I think you can use resample with ffill or bfill, but before set_index from column PriceDate:
print (data)
ID PriceDate OpenPrice HighPrice
0 1 6/24/2016 1 2
1 2 6/23/2016 3 4
2 2 6/22/2016 5 6
3 2 6/21/2016 7 8
4 2 6/20/2016 9 10
5 2 6/17/2016 11 12
6 2 6/16/2016 13 14
data['PriceDate'] = pd.to_datetime(data['PriceDate'], format='%m/%d/%Y')
data = data.sort_values(by=['PriceDate'], ascending=[True])
data.set_index('PriceDate', inplace=True)
print (data)
ID OpenPrice HighPrice
PriceDate
2016-06-16 2 13 14
2016-06-17 2 11 12
2016-06-20 2 9 10
2016-06-21 2 7 8
2016-06-22 2 5 6
2016-06-23 2 3 4
2016-06-24 1 1 2
data = data.resample('D').ffill().reset_index()
print (data)
PriceDate ID OpenPrice HighPrice
0 2016-06-16 2 13 14
1 2016-06-17 2 11 12
2 2016-06-18 2 11 12
3 2016-06-19 2 11 12
4 2016-06-20 2 9 10
5 2016-06-21 2 7 8
6 2016-06-22 2 5 6
7 2016-06-23 2 3 4
8 2016-06-24 1 1 2
data = data.resample('D').bfill().reset_index()
print (data)
PriceDate ID OpenPrice HighPrice
0 2016-06-16 2 13 14
1 2016-06-17 2 11 12
2 2016-06-18 2 9 10
3 2016-06-19 2 9 10
4 2016-06-20 2 9 10
5 2016-06-21 2 7 8
6 2016-06-22 2 5 6
7 2016-06-23 2 3 4
8 2016-06-24 1 1 2