Fill the missing date values in a Pandas Dataframe column - python

I'm using Pandas to store stock prices data using Data Frames. There are 2940 rows in the dataset. The Dataset snapshot is displayed below:
The time series data does not contain the values for Saturday and Sunday. Hence missing values have to be filled.
Here is the code I've written but it is not solving the problem:
import pandas as pd
import numpy as np
import os
os.chdir('C:/Users/Admin/Analytics/stock-prices')
data = pd.read_csv('stock-data.csv')
# PriceDate Column - Does not contain Saturday and Sunday stock entries
data['PriceDate'] = pd.to_datetime(data['PriceDate'], format='%m/%d/%Y')
data = data.sort_index(by=['PriceDate'], ascending=[True])
# Starting date is Aug 25 2004
idx = pd.date_range('08-25-2004',periods=2940,freq='D')
data = data.set_index(idx)
data['newdate']=data.index
newdate=data['newdate'].values # Create a time series column
data = pd.merge(newdate, data, on='PriceDate', how='outer')
How to fill the missing values for Saturday and Sunday?

I think you can use resample with ffill or bfill, but before set_index from column PriceDate:
print (data)
ID PriceDate OpenPrice HighPrice
0 1 6/24/2016 1 2
1 2 6/23/2016 3 4
2 2 6/22/2016 5 6
3 2 6/21/2016 7 8
4 2 6/20/2016 9 10
5 2 6/17/2016 11 12
6 2 6/16/2016 13 14
data['PriceDate'] = pd.to_datetime(data['PriceDate'], format='%m/%d/%Y')
data = data.sort_values(by=['PriceDate'], ascending=[True])
data.set_index('PriceDate', inplace=True)
print (data)
ID OpenPrice HighPrice
PriceDate
2016-06-16 2 13 14
2016-06-17 2 11 12
2016-06-20 2 9 10
2016-06-21 2 7 8
2016-06-22 2 5 6
2016-06-23 2 3 4
2016-06-24 1 1 2
data = data.resample('D').ffill().reset_index()
print (data)
PriceDate ID OpenPrice HighPrice
0 2016-06-16 2 13 14
1 2016-06-17 2 11 12
2 2016-06-18 2 11 12
3 2016-06-19 2 11 12
4 2016-06-20 2 9 10
5 2016-06-21 2 7 8
6 2016-06-22 2 5 6
7 2016-06-23 2 3 4
8 2016-06-24 1 1 2
data = data.resample('D').bfill().reset_index()
print (data)
PriceDate ID OpenPrice HighPrice
0 2016-06-16 2 13 14
1 2016-06-17 2 11 12
2 2016-06-18 2 9 10
3 2016-06-19 2 9 10
4 2016-06-20 2 9 10
5 2016-06-21 2 7 8
6 2016-06-22 2 5 6
7 2016-06-23 2 3 4
8 2016-06-24 1 1 2

Related

Python pandas: set consecutive index (starting at 0) for every group in groupby

I have a dataframe sorted and grouped by serial number, then date:
df = df.sort_values(by=["serial_num" , "date"])
df = df.groupby(df['serial_num'])['date'].some_function()
Dataframe
serial_num date
0 1 2001-01-01
1 1 2001-02-01
2 1 2001-03-01
3 1 2001-04-01
4 3 2003-05-01
5 3 2003-06-01
6 3 2003-07-01
7 7 2005-07-01
8 7 2005-08-01
9 7 2005-09-01
10 7 2005-10-01
11 7 2005-11-01
12 7 2005-12-01
Each unique serial_num group will be a line on a line graph.
The way it is graphed now, each line is a time series that starts at a different point -- because the first date is different for every serial_num group.
I need the x-axis of the graph to be time instead of date. All lines on the graph will start at the same point - the origin of the x-axis of my graph.
I think the easiest way to do this would be to add a consecutive index that starts at 0 for each group, like this:
serial_num date new_index
0 1 2001-01-01 0
1 1 2001-02-01 1
2 1 2001-03-01 2
3 1 2001-04-01 3
4 3 2003-05-01 0
5 3 2003-06-01 1
6 3 2003-07-01 2
7 7 2005-07-01 0
8 7 2005-08-01 1
9 7 2005-09-01 2
10 7 2005-10-01 3
11 7 2005-11-01 4
12 7 2005-12-01 5
Then, I think I will be able to graph (in Plotly) with all lines starting at the same point (the 0 index will be the first data point for each serial_num.
NOTE: each serial_num group has a different number of data points.
I'm unsure how to index with groupby this way. Please help! Or if you know another method that will accomplish the same goal, please share. Thanks!
Use cumcount:
df["new_index"] = df.groupby("serial_num").cumcount()
print(df)
==>
serial_num date new_index
0 1 2001-01-01 0
1 1 2001-02-01 1
2 1 2001-03-01 2
3 1 2001-04-01 3
4 3 2003-05-01 0
5 3 2003-06-01 1
6 3 2003-07-01 2
7 7 2005-07-01 0
8 7 2005-08-01 1
9 7 2005-09-01 2
10 7 2005-10-01 3
11 7 2005-11-01 4
12 7 2005-12-01 5

Select dataset records by dates from last 7 week days

Given below dataset, I want to filter all records which have dates belonging to last 7 week days.
record_id,date,site,sick,funny,happy
CDEC1947-6,9/2/2018,2,1,1,1
IJKC1953-4,9/29/2018,2,1,1,1
FGHC1724-9,10/25/2018,2,3,1,1
FGHC2929-1,10/31/2018,4,1,1,1
CDEC1912-0,11/1/2018,1,1,1,1
IJKC1726-4,11/2/2018,1,3,1,1
IJKC1728-0,10/26/2018,2,3,1,1
ABCC1730-6,11/2/2018,2,3,1,1
ABCC1731-4,11/2/2018,2,3,1,1
CDEC1733-0,10/22/2018,1,3,1,1
CDEC1735-5,11/2/2018,2,3,1,1
IJKC1914-6,10/27/2018,2,6,1,1
ABCC1916-1,10/23/2018,2,6,1,1
IJKC1918-7,11/2/2018,2,1,1,1
CDEC1920-3,10/24/2018,1,6,1,1
IJKC1943-5,11/2/2018,2,4,1,1
ABCC1945-0,11/2/2018,1,4,1,1
ABCC1949-2,10/25/2018,2,4,1,1
CDEC1951-8,11/2/2018,2,5,1,1
CDEC2924-2,11/3/2018,4,1,1,1
CDEC2927-5,11/3/2018,1,1,1,1
ABCC2925-9,11/4/2018,4,1,1,1
IJKC1941-9,11/4/2018,2,4,1,1
ABCC2922-6,11/5/2018,1,1,1,1
I have tried many tricks without success.
One of them the below:
df['data_recrutamento'] = pd.to_datetime(df['data_recrutamento'])
m1 = (df['sick'] == 1) | (df['funny'] == 1) | (df['happy'] == 1)
m2 = df['date'] >= pd.Timestamp('today') - pd.DateOffset(days=7)
m3 = ~df['date'].dt.weekday.isin([5, 6])
dates_last7_weekdays = df.loc[m1 & m2 & m3, 'site'].value_counts()
dates_last7_weekdays
dates_last7_weekdays = df.loc[m1 & m2 & m3, 'site'].value_counts()
dates_last7_weekdays
Other attempt example:
import pandas as pd
import numpy as np
from plotly.offline import init_notebook_mode, iplot
from plotly.graph_objs import *
import plotly.graph_objs as go
import datetime
from datetime import date
from datetime import timedelta
today = date.today()
from IPython.core.interactiveshell import InteractiveShell
%matplotlib inline
df=pd.read_csv("dataset.csv", encoding="utf-8",low_memory=False)
df["date"]=pd.to_datetime(df["date"])
df["site"]=df["site"].astype("category") # Convert to category
df['sick']=df['sick'].astype('category')
df["funny"]=df["funny"].astype("category")
df["happy"]=df["happy"].astype("category")
df = df.sort_values(by='date', ascending='True')
df.head()
record_id date site sick funny happy
0 CDEC1947-6 2018-09-02 2 1 1 1
1 IJKC1953-4 2018-09-29 2 1 1 1
9 CDEC1733-0 2018-10-22 1 3 1 1
12 ABCC1916-1 2018-10-23 2 6 1 1
14 CDEC1920-3 2018-10-24 1 6 1 1
2 FGHC1724-9 2018-10-25 2 3 1 1
17 ABCC1949-2 2018-10-25 2 4 1 1
6 IJKC1728-0 2018-10-26 2 3 1 1
11 IJKC1914-6 2018-10-27 2 6 1 1
3 FGHC2929-1 2018-10-31 4 1 1 1
4 CDEC1912-0 2018-11-01 1 1 1 1
7 ABCC1730-6 2018-11-02 2 3 1 1
10 CDEC1735-5 2018-11-02 2 3 1 1
5 IJKC1726-4 2018-11-02 1 3 1 1
13 IJKC1918-7 2018-11-02 2 1 1 1
15 IJKC1943-5 2018-11-02 2 4 1 1
16 ABCC1945-0 2018-11-02 1 4 1 1
18 CDEC1951-8 2018-11-02 2 5 1 1
8 ABCC1731-4 2018-11-02 2 3 1 1
19 CDEC2924-2 2018-11-03 4 1 1 1
20 CDEC2927-5 2018-11-03 1 1 1 1
22 IJKC1941-9 2018-11-04 2 4 1 1
21 ABCC2925-9 2018-11-04 4 1 1 1
23 ABCC2922-6 2018-11-05 1 1 1 1
days_diff = []
for i in df.loc[:, 'date']:
days_diff.append(((datetime.datetime.today() - i).days))
final=df[(pd.Series(days_diff) <= 7) & ((df.loc[:, 'sick'] == 1)|(df.loc[:, 'funny'] == 1)|(df.loc[:, 'happy'] == 1) )]
C:\Users\H\Miniconda3\lib\site-packages\ipykernel_launcher.py:10: UserWarning:
Boolean Series key will be reindexed to match DataFrame index.
len(final)
21
final
record_id date site sick funny happy
9 CDEC1733-0 2018-10-22 1 3 1 1
12 ABCC1916-1 2018-10-23 2 6 1 1
14 CDEC1920-3 2018-10-24 1 6 1 1
17 ABCC1949-2 2018-10-25 2 4 1 1
11 IJKC1914-6 2018-10-27 2 6 1 1
10 CDEC1735-5 2018-11-02 2 3 1 1
13 IJKC1918-7 2018-11-02 2 1 1 1
15 IJKC1943-5 2018-11-02 2 4 1 1
16 ABCC1945-0 2018-11-02 1 4 1 1
18 CDEC1951-8 2018-11-02 2 5 1 1
19 CDEC2924-2 2018-11-03 4 1 1 1
20 CDEC2927-5 2018-11-03 1 1 1 1
22 IJKC1941-9 2018-11-04 2 4 1 1
21 ABCC2925-9 2018-11-04 4 1 1 1
23 ABCC2922-6 2018-11-05 1 1 1 1
But my desired result should only be in maximum 7 different dates in rows and not more than that because I just want to filter last 7 week days using today's date as my reference. So, according to the dataset my aimed output should not include these dates as they are in the weekends 2018-11-04, 2018-11-03 and these dates 2018-10-22, 2018-10-23, 2018-10-24, 2018-10-25 and 2018-10-27 should not be included as they are not part of last 7 weekdays. So, my final output should only be:
record_id date site sick funny happy
10 CDEC1735-5 2018-11-02 2 3 1 1
13 IJKC1918-7 2018-11-02 2 1 1 1
15 IJKC1943-5 2018-11-02 2 4 1 1
16 ABCC1945-0 2018-11-02 1 4 1 1
18 CDEC1951-8 2018-11-02 2 5 1 1
23 ABCC2922-6 2018-11-05 1 1 1 1
Because those dates belong to the date interval corresponding to the last 7 weekdays from 2018-11-06 to 2018-11-29 (reference is today when I am writing this 2018-11-06 but tomorrow shoulb be 2018-11-07).
A straight forward way is to subtract and find day differences, and use it for subseting. We use datetime.datetime.today() in order to get today's datetime. We then use that datetime to subtract each entry from your df.loc[:, 'dates'] column. To make sure that we will not get time along with the days difference, we use the (...).days at the end. We then use the comparative less than or equal operation to create a boolean series, indicating which entries are less than or equal to 7 days. Fianlly, we use that boolean series to filter our data frame
import datetime
days_diff = []
for i in df.loc[:, 'date']:
days_diff.append(((datetime.datetime.today() - i).days))
#subset your data frame
df[pd.Series(days_diff) <= 7]
#or to include the other conditions as well,
df[(pd.Series(days_diff) <= 7) & ((df.loc[:, 'sick'] == 1)|(df.loc[:, 'funny'] == 1)|(df.loc[:, 'happy'] == 1) )]
NOTE: Convert your date column to proper datetime first

Compute a ratio conditional on the value in the column of a panda dataframe

I have a dataframe of the following type
df = pd.DataFrame({'Days':[1,2,5,6,7,10,11,12],
'Value':[100.3,150.5,237.0,314.15,188.0,413.0,158.2,268.0]})
Days Value
0 1 100.3
1 2 150.5
2 5 237.0
3 6 314.15
4 7 188.0
5 10 413.0
6 11 158.2
7 12 268.0
and I would like to add a column '+5Ratio' whose date is the ratio betwen Value corresponding to the Days+5 and Days.
For example in first row I would have 3.13210368893 = 314.15/100.3, in the second I would have 1.24916943522 = 188.0/150.5 and so on.
Days Value +5Ratio
0 1 100.3 3.13210368893
1 2 150.5 1.24916943522
2 5 237.0 ...
3 6 314.15
4 7 188.0
5 10 413.0
6 11 158.2
7 12 268.0
I'm strugling to find a way to do it using lambda function.
Could someone give a help to find a way to solve this problem?
Thanks in advance.
Edit
In the case I am interested in the "Days" field can vary sparsly from 1 to 18180 for instance.
You can using merge , and the benefit from doing this , can handle missing value
s=df.merge(df.assign(Days=df.Days-5),on='Days')
s.assign(Value=s.Value_y/s.Value_x).drop(['Value_x','Value_y'],axis=1)
Out[359]:
Days Value
0 1 3.132104
1 2 1.249169
2 5 1.742616
3 6 0.503581
4 7 1.425532
Consider left merging on a helper dataframe, days, for consecutive daily points and then shift by 5 rows for ratio calculation. Finally remove the blank day rows:
days_df = pd.DataFrame({'Days':range(min(df.Days), max(df.Days)+1)})
days_df = days_df.merge(df, on='Days', how='left')
print(days_df)
# Days Value
# 0 1 100.30
# 1 2 150.50
# 2 3 NaN
# 3 4 NaN
# 4 5 237.00
# 5 6 314.15
# 6 7 188.00
# 7 8 NaN
# 8 9 NaN
# 9 10 413.00
# 10 11 158.20
# 11 12 268.00
days_df['+5ratio'] = days_df.shift(-5)['Value'] / days_df['Value']
final_df = days_df[days_df['Value'].notnull()].reset_index(drop=True)
print(final_df)
# Days Value +5ratio
# 0 1 100.30 3.132104
# 1 2 150.50 1.249169
# 2 5 237.00 1.742616
# 3 6 314.15 0.503581
# 4 7 188.00 1.425532
# 5 10 413.00 NaN
# 6 11 158.20 NaN
# 7 12 268.00 NaN

Pandas groupby() transform() max() with filter

I have a dataframe like this:
id date value
1 12/01/2016 5
1 25/02/2016 7
1 10/03/2017 13
2 02/04/2016 0
2 06/07/2016 1
2 12/03/2017 6
I'm looking to create a column called 'max_ever' for each unique value of 'id'
I can do: df['max_ever']=df.groupby(['id'])['value'].transform(max)
Which would give me:
id date value max_ever
1 12/01/2016 5 13
1 25/02/2016 7 13
1 10/03/2017 13 13
2 02/04/2016 0 6
2 06/07/2016 1 6
2 12/03/2017 6 6
But I would like to add another column called 'max_12_months' from today() for each unique value of 'id'
I can create a new dataframe with the filtered date and repeat the above, but I'd like to try filter and transform within this dataframe.
The final dataframe would look like this:
id date value max_ever max_12_months
1 12/01/2016 13 13 7
1 25/05/2016 7 13 7
1 10/03/2017 5 13 7
2 02/04/2016 6 6 2
2 06/07/2016 1 6 2
2 12/03/2017 2 6 2
Appreciate any help!
Custom agg function to be apply'd... Then join
today = pd.to_datetime(pd.datetime.today()).floor('D')
year_ago = today - pd.offsets.Day(366)
def max12(df):
return df.value.loc[df.date.between(year_ago, today)].max()
def aggf(df):
return pd.Series(
[df.value.max(), max12(df)],
['max_ever', 'max_12_months']
)
df.join(df.groupby('id').apply(aggf), on='id')
id date value max_ever max_12_months
0 1 2016-01-12 13 13 7
1 1 2016-05-25 7 13 7
2 1 2017-03-10 5 13 7
3 2 2016-04-02 6 6 2
4 2 2016-07-06 1 6 2
5 2 2017-03-12 2 6 2

Filtering Pandas Dataframe by mean of last N values

I'm trying to get all records where the mean of the last 3 rows is greater than the overall mean for all rows in a filtered set.
_filtered_d_all = _filtered_d.iloc[:, 0:50].loc[:, _filtered_d.mean()>0.05]
_last_n_records = _filtered_d.tail(3)
Something like this
_filtered_growing = _filtered_d.iloc[:, 0:50].loc[:, _last_n_records.mean() > _filtered_d.mean()]
However, the problem here is that the value length is incorrect. Any tips?
ValueError: Series lengths must match to compare
Sample Data
This has an index on the year and month, and 2 columns.
Col1 Col2
year month
2005 12 0.533835 0.170679
12 0.494733 0.198347
2006 3 0.440098 0.202240
6 0.410285 0.188421
9 0.502420 0.200188
12 0.522253 0.118680
2007 3 0.378120 0.171192
6 0.431989 0.145158
9 0.612036 0.178097
12 0.519766 0.252196
2008 3 0.547705 0.202163
6 0.560985 0.238591
9 0.617320 0.199537
12 0.343939 0.253855
Why not just boolean index directly on your filtered DataFrame with
df[df.tail(3).mean() > df.mean()]
Demo
>>> df
0 1 2 3 4
0 4 8 2 4 6
1 0 0 0 2 8
2 5 3 0 9 3
3 7 5 5 1 2
4 9 7 8 9 4
>>> df[df.tail(3).mean() > df.mean()]
0 1 2 3 4
0 4 8 2 4 6
1 0 0 0 2 8
2 5 3 0 9 3
3 7 5 5 1 2
Update example for MultiIndex edit
The same should work fine for your MultiIndex sample, we just have to mask a bit differently of course.
>>> df
col1 col2
2005 12 -0.340088 -0.574140
12 -0.814014 0.430580
2006 3 0.464008 0.438494
6 0.019508 -0.635128
9 0.622645 -0.824526
12 -1.674920 -1.027275
2007 3 0.397133 0.659467
6 0.026170 -0.052063
9 0.835561 0.608067
12 0.736873 -0.613877
2008 3 0.344781 -0.566392
6 -0.653290 -0.264992
9 0.080592 -0.548189
12 0.585642 1.149779
>>> df.loc[:,df.tail(3).mean() > df.mean()]
col2
2005 12 -0.574140
12 0.430580
2006 3 0.438494
6 -0.635128
9 -0.824526
12 -1.027275
2007 3 0.659467
6 -0.052063
9 0.608067
12 -0.613877
2008 3 -0.566392
6 -0.264992
9 -0.548189
12 1.149779

Categories