I need to add some values to a dataframe based on the ID and DATE_TWO columns. In the case when DATE_TWO >= DATE_ONE then fill in any subsequent DATE_TWO values for that ID with the first DATE_TWO value. Here is the original dataframe:
ID
EVENT
DATE_ONE
DATE_TWO
1
13
3/1/2021
1
20
3/5/2021
3/5/2021
1
32
3/6/2021
1
43
3/7/2021
2
1
3/3/2021
2
2
4/5/2021
3
1
3/1/2021
3
12
3/7/2021
3/7/2021
3
13
3/9/2021
3
15
3/14/2021
Here is what the table after transformation:
ID
EVENT
DATE_ONE
DATE_TWO
1
13
3/1/2021
1
20
3/5/2021
3/5/2021
1
32
3/6/2021
3/5/2021
1
43
3/7/2021
3/5/2021
2
1
3/3/2021
2
2
4/5/2021
3
1
3/1/2021
3
12
3/7/2021
3/7/2021
3
13
3/9/2021
3/7/2021
3
15
3/14/2021
3/7/2021
This could be done with a for loop, but I know in python - particularly with dataframes - for loops can be slow. Is there some other more python and computationally speedy way to accomplish what I am seeking?
data = {'ID': [1,1,1,1,2,2,3,3,3,3],
'EVENT': [12, 20, 32, 43,1,2,1,12,13,15],
'DATE_ONE': ['3/1/2021','3/5/2021','3/6/2021','3/7/2021','3/3/2021','4/5/2021',
'3/1/2021','3 /7/2021','3/9/2021','3/14/2021'],
'DATE_TWO': ['','3/5/2021','','','','','','3/7/2021','','']}
I slightly changed your data so we can see how it works.
Data
import pandas as pd
import numpy as np
data = {'ID': [1,1,1,1,2,2,3,3,3,3],
'EVENT': [12, 20, 32, 43,1,2,1,12,13,15],
'DATE_ONE': ['3/1/2021','3/5/2021','3/6/2021','3/7/2021','3/3/2021','4/5/2021',
'3/1/2021','3 /7/2021','3/9/2021','3/14/2021'],
'DATE_TWO': ['','3/5/2021','','','','','3/7/2021','','3/7/2021','']}
df = pd.DataFrame(data)
df["DATE_ONE"] = pd.to_datetime(df["DATE_ONE"])
df["DATE_TWO"] = pd.to_datetime(df["DATE_TWO"])
# We better sort DATE_ONE
df = df.sort_values(["ID", "DATE_ONE"]).reset_index(drop=True)
FILL with condition
df["COND"] = np.where(df["DATE_ONE"].le(df["DATE_TWO"]).eq(True),
1,
np.where(df["DATE_TWO"].notnull() &
df["DATE_ONE"].gt(df["DATE_TWO"]),
0,
np.nan))
grp = df.groupby("ID")
df["COND"] = grp["COND"].fillna(method='ffill').fillna(0)
df["FILL"] = grp["DATE_TWO"].fillna(method='ffill')
df["DATE_TWO"] = np.where(df["COND"].eq(1), df["FILL"], df["DATE_TWO"])
df = df.drop(columns=["COND", "FILL"])
ID EVENT DATE_ONE DATE_TWO
0 1 12 2021-03-01 NaT
1 1 20 2021-03-05 2021-03-05
2 1 32 2021-03-06 2021-03-05
3 1 43 2021-03-07 2021-03-05
4 2 1 2021-03-03 NaT
5 2 2 2021-04-05 NaT
6 3 1 2021-03-01 2021-03-07
7 3 12 2021-03-07 2021-03-07
8 3 13 2021-03-09 2021-03-07
9 3 15 2021-03-14 NaT
Related
I am learning pandas in the past couple of months. I have a data frame like this:
index Random id diff pct
0 2018-01-01 31 1 3 1
1 2018-01-02 11 1 2 2
2 2018-01-03 21 1 4 0
3 2018-01-04 23 2 1 0
4 2018-01-05 43 2 6 3
5 2018-01-06 42 2 1 1
6 2018-01-07 51 3 2 5
7 2018-01-08 47 3 2 0
8 2018-01-09 49 3 3 2
9 2018-01-10 22 3 1 3
What I want is to create a column recommend by 'Yes' and 'NO' by conditioning on other columns which I can do, but I also need to update the value of the 'Random' column on each row(or create a new column) with updating info for random column if the 'recommend' Column is Yes. For instance, The condition is if pct<diff, then 'recommand' column will be 'Yes' and 'Random'/'New_random' will be Random+diff, otherwise the 'recommand' column will be 'No' and 'Random'/'New_random' will be Random value of the previous row. FYI, we have to update 'Random'/'New_Random' column if 'recommand' is yes for that row and later rows for each id. The expected output should look like this
index Random id diff pct recommend Random_new
0 2018-01-01 31 1 3 1 Y 32
1 2018-01-02 31 1 2 2 N 32
2 2018-01-03 31 1 4 0 Y 36
3 2018-01-04 23 2 1 0 Y 24
4 2018-01-05 23 2 6 3 Y 27
5 2018-01-06 23 2 1 1 N 27
6 2018-01-07 51 3 2 5 N 51
7 2018-01-08 51 3 2 0 Y 53
8 2018-01-09 51 3 3 2 Y 56
9 2018-01-10 51 3 1 3 N 56
I have tried np.where which only create the column but don't update the row value for 'Random_new'. I feel like I need to create a for loop with if else condition, but could not do it so far.
The condition as bullet points:
If pct < diff 'Random_new'[i] = 'Random'[i]+'Diff'[i]
else 'Random_new'[i]='Random_new'[i-1]
With updating that row also update the later rows for 'Random_new'
This needs to be for each id(probably using groupby) separately
First I'm not sure how you filled your values in your example but shouldn't the first Random_new be equal to 31+3=34 instead of 32?
Anyway you can first create you recommend column (boolean seems better adapted than Y/N) then create the Random_new with apply (only when recommend is True) and finally fill (ffill) the values when grouped by id:
df['recommend'] = df['pct'] < df['diff']
df['Random_new'] = df.apply(lambda x: x['Random'] + x['diff'] if x['recommend'] else None, axis=1)
df = df.groupby('id').ffill()
Output:
index Random diff pct recommend Random_new
0 2018-01-01 31 3 1 True 34.0
1 2018-01-02 11 2 2 False 34.0
2 2018-01-03 21 4 0 True 25.0
3 2018-01-04 23 1 0 True 24.0
4 2018-01-05 43 6 3 True 49.0
5 2018-01-06 42 1 1 False 49.0
6 2018-01-07 51 2 5 False NaN
7 2018-01-08 47 2 0 True 49.0
8 2018-01-09 49 3 2 True 52.0
9 2018-01-10 22 1 3 False 52.0
Edit: if you wanna keep the id column replace the last line with:
df = pd.concat([df['id'], df.groupby('id').ffill()], axis=1)
(kwarg as_index=False doesn't help in this case)
Say I have a vector ValsHR which looks like this:
valsHR=[78.8, 82.3, 91.0]
And I have a dataframe MainData
Age Patient HR
21 1 NaN
21 1 NaN
21 1 NaN
30 2 NaN
30 2 NaN
24 3 NaN
24 3 NaN
24 3 NaN
I want to fill the NaNs so that the first value in valsHR will only fill in the NaNs for patient 1, the second will fill the NaNs for patient 2 and the third will fill in for patient 3.
So far I've tried using this:
mainData['HR'] = mainData['HR'].fillna(ValsHR) but it fills all the NaNs with the first value in the vector.
I've also tried to use this:
mainData['HR'] = mainData.groupby('Patient').fillna(ValsHR) fills the NaNs with values that aren't in the valsHR vector at all.
I was wondering if anyone knew a way to do this?
Create dictionary by Patient values with missing values, map to original column and replace missing values only:
print (df)
Age Patient HR
0 21 1 NaN
1 21 1 NaN
2 21 1 NaN
3 30 2 100.0 <- value is not replaced
4 30 2 NaN
5 24 3 NaN
6 24 3 NaN
7 24 3 NaN
p = df.loc[df.HR.isna(), 'Patient'].unique()
valsHR = [78.8, 82.3, 91.0]
df['HR'] = df['HR'].fillna(df['Patient'].map(dict(zip(p, valsHR))))
print (df)
Age Patient HR
0 21 1 78.8
1 21 1 78.8
2 21 1 78.8
3 30 2 100.0
4 30 2 82.3
5 24 3 91.0
6 24 3 91.0
7 24 3 91.0
If some groups has no NaNs:
print (df)
Age Patient HR
0 21 1 NaN
1 21 1 NaN
2 21 1 NaN
3 30 2 100.0 <- group 2 is not replaced
4 30 2 100.0 <- group 2 is not replaced
5 24 3 NaN
6 24 3 NaN
7 24 3 NaN
p = df.loc[df.HR.isna(), 'Patient'].unique()
valsHR = [78.8, 82.3, 91.0]
df['HR'] = df['HR'].fillna(df['Patient'].map(dict(zip(p, valsHR))))
print (df)
Age Patient HR
0 21 1 78.8
1 21 1 78.8
2 21 1 78.8
3 30 2 100.0
4 30 2 100.0
5 24 3 82.3
6 24 3 82.3
7 24 3 82.3
It is simply mapping, if all of NaN should be replaced
import pandas as pd
from io import StringIO
valsHR=[78.8, 82.3, 91.0]
vals = {i:k for i,k in enumerate(valsHR, 1)}
df = pd.read_csv(StringIO("""Age Patient
21 1
21 1
21 1
30 2
30 2
24 3
24 3
24 3"""), sep="\s+")
df["HR"] = df["Patient"].map(vals)
>>> df
Age Patient HR
0 21 1 78.8
1 21 1 78.8
2 21 1 78.8
3 30 2 82.3
4 30 2 82.3
5 24 3 91.0
6 24 3 91.0
7 24 3 91.0
I do have the following DataFrame
draw_date midday_daily evening_daily midday_win_4 evening_win_4
0 2020-10-05 582 577 5490 4958
1 2020-10-06 318 176 2137 9956
Which am trying to convert it into the following Shape:
draw_date draw_period winning_numbers wn_01 wn_02 wn_03 wn_04 wn_sum
0 2020-10-05 Midday 5 4 9 0 5 4 9 0 18
1 2020-10-05 Evening 4 9 5 8 4 9 5 8 26
2 2020-10-06 Midday 2 1 3 7 2 1 3 7 13
3 2020-10-06 Evening 9 9 5 6 9 9 5 6 29
That's what I've achieved yet:
import pandas as pd
df = pd.DataFrame.from_dict({'draw_date': {0: ('2020-10-05 00:00:00'), 1: ('2020-10-06 00:00:00')}, 'midday_daily': {0: '582', 1: '318'},
'evening_daily': {0: '577', 1: '176'}, 'midday_win_4': {0: '5490', 1: '2137'}, 'evening_win_4': {0: '4958', 1: '9956'}})
df.drop(df.columns[1:3], axis=1, inplace=True)
df['draw_date'] = pd.to_datetime(df['draw_date'])
print(df)
Output:
draw_date midday_win_4 evening_win_4
0 2020-10-05 5490 4958
1 2020-10-06 2137 9956
A little bit more verbose/descript approach
def split_numbers(df, column, prefix=None):
split_col = df[column].astype(str).map(list)
out = pd.DataFrame(split_col.tolist()).astype(int)
out.columns += 1
return df.join(out.add_prefix(prefix))
(df.filter(regex=r"(?:draw_date|win)") # Select the draw_date and "win" columns
.rename(columns=lambda col: col.replace("_win_4", "")) # Remove suffix "_win_4"
.melt( # Reshape the data
id_vars="draw_date",
var_name="draw_period",
value_name="winning_numbers")
.pipe(split_numbers, "winning_numbers", prefix="wn_0") # Extract out the winning numbers and assign back to df
.assign( # Create a sum column
wn_sum=lambda df: df.filter(like="wn").sum(axis=1))
.sort_values( # sort by draw_date and draw_period to line up with OP
["draw_date", "draw_period"],
ascending=[True, False])
)
outputs:
draw_date draw_period winning_numbers wn_01 wn_02 wn_03 wn_04 wn_sum
0 2020-10-05 midday 5490 5 4 9 0 18
2 2020-10-05 evening 4958 4 9 5 8 26
1 2020-10-06 midday 2137 2 1 3 7 13
3 2020-10-06 evening 9956 9 9 5 6 29
# set index and stack
stack = df.set_index('draw_date').stack()
# map list to your stacked series and create a new frame
new_df = pd.DataFrame(list(map(list, stack)), index=stack.index)
# sum the rows column-wise
new_df['sum'] = new_df.astype(int).sum(1)
# add the winning numbers back
new_df['winning numbers'] = stack
print(new_df)
0 1 2 3 sum winning numbers
draw_date
2020-10-05 midday_win_4 5 4 9 0 18 5490
evening_win_4 4 9 5 8 26 4958
2020-10-06 midday_win_4 2 1 3 7 13 2137
evening_win_4 9 9 5 6 29 9956
I am working on a data set with the following columns:
order_id
order_item_id
product mrp
units
sale_date
I want to create a new column which shows how much the mrp changed from the last time this product was. This there a way I can do this with pandas data frame?
Sorry if this question is very basic but I am pretty new to pandas.
Sample data:
expected data:
For each row of the data I want to check the amount of price change for the last time the product was sold.
You can do this as follows:
# define a function that applies rolling window calculationg
# taking the difference between the last value and the current
# value
def calc_mrp(ser):
# in case you want the relative change, just
# divide by x[1] or x[0] in the lambda function
return ser.rolling(window=2).apply(lambda x: x[1]-x[0])
# apply this to the grouped 'product_mrp' column
# and store the result in a new column
df['mrp_change']=df.groupby('product_id')['product_mrp'].apply(calc_mrp)
If this is executed on a dataframe like:
Out[398]:
order_id product_id product_mrp units_sold sale_date
0 0 2 647.169280 8 2019-08-23
1 1 0 500.641188 0 2019-08-24
2 2 1 647.789399 15 2019-08-25
3 3 0 381.278167 12 2019-08-26
4 4 2 373.685000 7 2019-08-27
5 5 4 553.472850 2 2019-08-28
6 6 4 634.482718 7 2019-08-29
7 7 3 536.760482 11 2019-08-30
8 8 0 690.242274 6 2019-08-31
9 9 4 500.515521 0 2019-09-01
It yields:
Out[400]:
order_id product_id product_mrp units_sold sale_date mrp_change
0 0 2 647.169280 8 2019-08-23 NaN
1 1 0 500.641188 0 2019-08-24 NaN
2 2 1 647.789399 15 2019-08-25 NaN
3 3 0 381.278167 12 2019-08-26 -119.363022
4 4 2 373.685000 7 2019-08-27 -273.484280
5 5 4 553.472850 2 2019-08-28 NaN
6 6 4 634.482718 7 2019-08-29 81.009868
7 7 3 536.760482 11 2019-08-30 NaN
8 8 0 690.242274 6 2019-08-31 308.964107
9 9 4 500.515521 0 2019-09-01 -133.967197
The NaNs are in the rows, for which there is not previous order with the same product_id.
I'm using Pandas to store stock prices data using Data Frames. There are 2940 rows in the dataset. The Dataset snapshot is displayed below:
The time series data does not contain the values for Saturday and Sunday. Hence missing values have to be filled.
Here is the code I've written but it is not solving the problem:
import pandas as pd
import numpy as np
import os
os.chdir('C:/Users/Admin/Analytics/stock-prices')
data = pd.read_csv('stock-data.csv')
# PriceDate Column - Does not contain Saturday and Sunday stock entries
data['PriceDate'] = pd.to_datetime(data['PriceDate'], format='%m/%d/%Y')
data = data.sort_index(by=['PriceDate'], ascending=[True])
# Starting date is Aug 25 2004
idx = pd.date_range('08-25-2004',periods=2940,freq='D')
data = data.set_index(idx)
data['newdate']=data.index
newdate=data['newdate'].values # Create a time series column
data = pd.merge(newdate, data, on='PriceDate', how='outer')
How to fill the missing values for Saturday and Sunday?
I think you can use resample with ffill or bfill, but before set_index from column PriceDate:
print (data)
ID PriceDate OpenPrice HighPrice
0 1 6/24/2016 1 2
1 2 6/23/2016 3 4
2 2 6/22/2016 5 6
3 2 6/21/2016 7 8
4 2 6/20/2016 9 10
5 2 6/17/2016 11 12
6 2 6/16/2016 13 14
data['PriceDate'] = pd.to_datetime(data['PriceDate'], format='%m/%d/%Y')
data = data.sort_values(by=['PriceDate'], ascending=[True])
data.set_index('PriceDate', inplace=True)
print (data)
ID OpenPrice HighPrice
PriceDate
2016-06-16 2 13 14
2016-06-17 2 11 12
2016-06-20 2 9 10
2016-06-21 2 7 8
2016-06-22 2 5 6
2016-06-23 2 3 4
2016-06-24 1 1 2
data = data.resample('D').ffill().reset_index()
print (data)
PriceDate ID OpenPrice HighPrice
0 2016-06-16 2 13 14
1 2016-06-17 2 11 12
2 2016-06-18 2 11 12
3 2016-06-19 2 11 12
4 2016-06-20 2 9 10
5 2016-06-21 2 7 8
6 2016-06-22 2 5 6
7 2016-06-23 2 3 4
8 2016-06-24 1 1 2
data = data.resample('D').bfill().reset_index()
print (data)
PriceDate ID OpenPrice HighPrice
0 2016-06-16 2 13 14
1 2016-06-17 2 11 12
2 2016-06-18 2 9 10
3 2016-06-19 2 9 10
4 2016-06-20 2 9 10
5 2016-06-21 2 7 8
6 2016-06-22 2 5 6
7 2016-06-23 2 3 4
8 2016-06-24 1 1 2