How to fill in missing dates and values in a Pandas DataFrame? - python

so the data set I am using is only business days but I want to change the date index such that it reflects every calendar day. When I use reindex and have to use reindex(), I am unsure how to use 'fill value' field of reindex to inherit the value above.
import pandas as pd
idx = pd.date_range("12/18/2019","12/24/2019")
df = pd.Series({'12/18/2019':22.63,
'12/19/2019':22.2,
'12/20/2019':21.03,
'12/23/2019':17,
'12/24/2019':19.65})
df.index = pd.DatetimeIndex(df.index)
df = df.reindex()
Currently, my data set looks like this.
However, when I use reindex I get the below result
In reality I want it to inherit the values directly above if it is a NaN result so the data set becomes the following
Thank you guys for your help!

You were close! You just need to pass the index you want to reindex on (idx in this case) as a parameter to the reindex method, and then you can set the method parameter to 'ffill' to propagate the last valid value forward.
idx = pd.date_range("12/18/2019","12/24/2019")
df = pd.Series({'12/18/2019':22.63,
'12/19/2019':22.2,
'12/20/2019':21.03,
'12/23/2019':17,
'12/24/2019':19.65})
df.index = pd.DatetimeIndex(df.index)
df = df.reindex(idx, method='ffill')

It seems that you have created a 'Series', not a dataframe. See if the code below helps you.
df = df.to_frame().reset_index() #to convert series to dataframe
df = df.fillna(method='ffill')
print(df)
Output You will have to rename columns
index 0
0 2019-12-18 22.63
1 2019-12-19 22.20
2 2019-12-20 21.03
3 2019-12-21 21.03
4 2019-12-22 21.03
5 2019-12-23 17.00
6 2019-12-24 19.65

Related

Resample a dataframe, interpolate NaNs and return a dataframe

I have a dataframe df that contains data in periods of 3 hours:
index , values
2003-01-01 00:00:00, 2.0
2003-01-01 03:00:00, 1.8
2003-01-01 06:00:00, 1.4
2003-01-01 09:00:00, 1.1
....
I want to resample the data to every hour and interpolate the missing values in between linearly. I can achieve something similar, filling the missing values with .bfill(), and it looks like this:
df2 = df.resample('H').bfill()
I tried to alter this to achive my task as follows:
df2 = df.resample('H')
df2.interpolate(method='linear', axis=0, inplace=True)
But df2 = df.resample('H') in contrast to df2 = df.resample('H').bfill() doesn't return a dataframe object, but a pandas.core.resample.DatetimeIndexResampler object.
Do you know how I can do the resampling and interpolation? Do you have some other work around? Tnx
I found out, that I could just append my initial approach with .interpolate() and it would work:
df2 = df.resample('H').interpolate()
df2.interpolate(method='linear', axis=0, inplace=True)

In pandas dataframes, how would you convert all index labels as type DatetimeIndex to datetime.datetime?

Just as the title says, I am trying to convert my DataFrame lables to type datetime. In the following attempted solution I pulled the labels from the DataFrame to dates_index and tried converting them to datetime by using the function DatetimeIndex.to_datetime, however, my compiler says that DatetimeIndex has no attribute to_datetime.
dates_index = df.index[0::]
dates = DatetimeIndex.to_datetime(dates_index)
I've also tried using the pandas.to_datetime function.
dates = pandas.to_datetime(dates_index, errors='coerce')
This returns the datetime wrapped in DatetimeIndex instead of just datetimes.
My DatetimeIndex labels contain data for date and time and my goal is to push that data into two seperate columns of the DataFrame.
if your DateTimeIndex is myindex, then
df.reset_index() will create a myindex column, which you can do what you want with, and if you want to make it an index again later, you can revert by `df.set_index('myindex')
You can set the index after converting the datatype of the column.
To convert datatype to datetime, use: to_datetime
And, to set the column as index use: set_index
Hope this helps!
import pandas as pd
df = pd.DataFrame({
'mydatecol': ['06/11/2020', '06/12/2020', '06/13/2020', '06/14/2020'],
'othcol1': [10, 20, 30, 40],
'othcol2': [1, 2, 3, 4]
})
print(df)
print(f'Index type is now {df.index.dtype}')
df['mydatecol'] = pd.to_datetime(df['mydatecol'])
df.set_index('mydatecol', inplace=True)
print(df)
print(f'Index type is now {df.index.dtype}')
Output is
mydatecol othcol1 othcol2
0 06/11/2020 10 1
1 06/12/2020 20 2
2 06/13/2020 30 3
3 06/14/2020 40 4
Index type is now int64
othcol1 othcol2
mydatecol
2020-06-11 10 1
2020-06-12 20 2
2020-06-13 30 3
2020-06-14 40 4
Index type is now datetime64[ns]
I found a quick solution to my problem. You can create a new pandas column based on the index and then use datetime to reformat the date.
df['date'] = df.index # Creates new column called 'date' of type Timestamp
df['date'] = df['date'].dt.strftime('%m/%d/%Y %I:%M%p') # Date formatting

Date comparison in python

i have already two datasets each one has 2 columns (date, close)
i want to compare date of the first dataset to the date of the second dataset if they are the same date the close of the second dataset takes the value relative to the date in question else it takes the value of the date of previous day.
This is the dataset https://www.euronext.com/fr/products/equities/FR0000120644-XPAR
https://fr.finance.yahoo.com/quote/%5EFCHI/history?period1=852105600&period2=1528873200&interval=1d&filter=history&frequency=1d
This is my code:
import numpy as np
from datetime import datetime , timedelta
import pandas as pd
#import cac 40 stock index (dataset1)
df = pd.read_csv('cac 40.csv')
df = pd.DataFrame(df)
#import Danone index(dataset2)
df1 = pd.read_excel('Price_Data_Danone.xlsx',header=3)
df1 = pd.DataFrame(df1)
#check the number of observation of both datasets and get the minimum number
if len(df1)>len(df):
size=len(df)
elif len(df1)<len(df):
size=len(df1)
else:
size=len(df)
#get new close values of dataset2 relative to the date in datset1
close1=np.zeros((size))
for i in range(0,size,1):
# find the date of dataset1 in dataset 2
if (df['Date'][i]in df1['Date']):
#get the index of the date and the corresponding value of close and store it in close1
close1[i]=df['close'][df1.loc['Date'][i], df['Date']]
else:
#if the date doesen't exist in datset2
#take value of close of previous date of datatset1
close1[i]=df['close'][df1.loc['Date'][i-1], df['Date']]
This is my trail, i got this error :
KeyError: 'the label [Date] is not in the [index]'
Examples:
we look for the value df['Date'][1] = '5/06/2009' in the column df1['Date']
we get its index in df1['Date']
then close1=df1['close'][index]
else if df['Date'][1] = '5/06/2009' not in df1['Date']
we get the index of the previous date df['Date'][0] = '4/06/2009'
close1=df1['close'][previous index]
Your error happens in line:
close1[i]=df['close'][df1.loc['Date'][i], df['Date']]
If your goal here is to get close value from df given i index you should write:
close[i] = df['close'][i]
See if that helps, unfortunately I don't understand fully what you are trying to accomplish, for example why do you set size to the length of shorter table?
Also, as long as I downloaded correct files, your condition df['Date'][i]in df1['Date'] might not work, one date format uses - and the other \.
Solution
import pandas as pd
pd.set_option('expand_frame_repr', False)
# load both files
df = pd.read_csv('CAC.csv')
df1 = pd.read_csv('DANONE.csv', header=3)
# ensure date format is the same between two
df.Date = pd.to_datetime(df.Date, dayfirst=True)
df1.Date = pd.to_datetime(df1.Date, dayfirst=True)
# you need only Date and Close columns as far as I understand
keep_columns = ['Date', 'Close']
# let's keep only these columns then
df = df[keep_columns]
df1 = df1[keep_columns]
# merge two tables on Date, method is left so that for every row in df we
# 'append' row from df1 if possible, if not there will be NaN value,
# for readability I added suffixes df - CAC and df1 - DANONE
merged = pd.merge(df,
df1,
on='Date',
how='left',
suffixes=['CAC', 'DANONE'])
# now for all missing values in CloseDANONE, so if there is Date in df
# but not in df1 we fill this value with LAST available
merged.CloseDANONE.fillna(method='ffill', inplace=True)
# we get values from CloseDANONE column as long as it's not null
close1 = merged.loc[merged.CloseDANONE.notnull(), 'CloseDANONE'].values
Below you can see:
last 6 values from df - CAC
Date Close
5522 2018-06-06 5457.560059
5523 2018-06-07 5448.359863
5524 2018-06-08 5450.220215
5525 2018-06-11 5473.910156
5526 2018-06-12 5453.370117
5527 2018-06-13 5468.240234
last 6 values from df1 - DANONE:
Date Close
0 2018-06-06 63.86
1 2018-06-07 63.71
2 2018-06-08 64.31
3 2018-06-11 64.91
4 2018-06-12 65.43
last 6 rows from merged:
Date CloseCAC CloseDANONE
5522 2018-06-06 5457.560059 63.86
5523 2018-06-07 5448.359863 63.71
5524 2018-06-08 5450.220215 64.31
5525 2018-06-11 5473.910156 64.91
5526 2018-06-12 5453.370117 65.43
5527 2018-06-13 5468.240234 65.43
For every value that was present in df we get value from df1, but 2018-06-13 is not present in df1 so I fill it with last available value which is 65.43 from 2018-06-12.

Python data-frame using pandas

I have a dataset which looks like below
[25/May/2015:23:11:15 000]
[25/May/2015:23:11:15 000]
[25/May/2015:23:11:16 000]
[25/May/2015:23:11:16 000]
Now i have made this into a DF and df[0] has [25/May/2015:23:11:15 and df[1] has 000]. I want to send all the data which ends with same seconds to a file. in the above example they end with 15 and 16 as seconds. So all ending with 15 seconds into one and the other into a different one and many more
I have tried the below code
import pandas as pd
data = pd.read_csv('apache-access-log.txt', sep=" ", header=None)
df = pd.DataFrame(data)
print(df[0],df[1].str[-2:])
Converting that column to a datetime would make it easier to work on, e.g.:
df['date'] = pd.to_datetime(df['date'], format='%d/%B/%Y:%H:%m:%S')
The you can simply iterate over a groupby(), e.g.:
In []:
for k, frame in df.groupby(df['date'].dt.second):
#frame.to_csv('file{}.csv'.format(k))
print('{}\n{}\n'.format(k, frame))
Out[]:
15
date value
0 2015-11-25 23:00:15 0
1 2015-11-25 23:00:15 0
16
date value
2 2015-11-25 23:00:16 0
3 2015-11-25 23:00:16 0
You can set your datetime as the index for the dataframe, and then use loc and to_csv Pandas' functions. Obviously, as other answers points out, you should convert your date to datetime while reading your dataframe.
Example:
df = df.set_index(['date'])
df.loc['25/05/2018 23:11:15':'25/05/2018 23:11:15'].to_csv('df_data.csv')
Try out this,
## Convert a new column with seconds value
df['seconds'] = df.apply(lambda row: row[0].split(":")[3].split(" ")[0], axis=1)
for sec in df['seconds'].unique():
## filter by seconds
print("Resutl ",df[df['seconds'] == sec])

Add missing dates to pandas dataframe

My data can have multiple events on a given date or NO events on a date. I take these events, get a count by date and plot them. However, when I plot them, my two series don't always match.
idx = pd.date_range(df['simpleDate'].min(), df['simpleDate'].max())
s = df.groupby(['simpleDate']).size()
In the above code idx becomes a range of say 30 dates. 09-01-2013 to 09-30-2013
However S may only have 25 or 26 days because no events happened for a given date. I then get an AssertionError as the sizes dont match when I try to plot:
fig, ax = plt.subplots()
ax.bar(idx.to_pydatetime(), s, color='green')
What's the proper way to tackle this? Do I want to remove dates with no values from IDX or (which I'd rather do) is add to the series the missing date with a count of 0. I'd rather have a full graph of 30 days with 0 values. If this approach is right, any suggestions on how to get started? Do I need some sort of dynamic reindex function?
Here's a snippet of S ( df.groupby(['simpleDate']).size() ), notice no entries for 04 and 05.
09-02-2013 2
09-03-2013 10
09-06-2013 5
09-07-2013 1
You could use Series.reindex:
import pandas as pd
idx = pd.date_range('09-01-2013', '09-30-2013')
s = pd.Series({'09-02-2013': 2,
'09-03-2013': 10,
'09-06-2013': 5,
'09-07-2013': 1})
s.index = pd.DatetimeIndex(s.index)
s = s.reindex(idx, fill_value=0)
print(s)
yields
2013-09-01 0
2013-09-02 2
2013-09-03 10
2013-09-04 0
2013-09-05 0
2013-09-06 5
2013-09-07 1
2013-09-08 0
...
A quicker workaround is to use .asfreq(). This doesn't require creation of a new index to call within .reindex().
# "broken" (staggered) dates
dates = pd.Index([pd.Timestamp('2012-05-01'),
pd.Timestamp('2012-05-04'),
pd.Timestamp('2012-05-06')])
s = pd.Series([1, 2, 3], dates)
print(s.asfreq('D'))
2012-05-01 1.0
2012-05-02 NaN
2012-05-03 NaN
2012-05-04 2.0
2012-05-05 NaN
2012-05-06 3.0
Freq: D, dtype: float64
One issue is that reindex will fail if there are duplicate values. Say we're working with timestamped data, which we want to index by date:
df = pd.DataFrame({
'timestamps': pd.to_datetime(
['2016-11-15 1:00','2016-11-16 2:00','2016-11-16 3:00','2016-11-18 4:00']),
'values':['a','b','c','d']})
df.index = pd.DatetimeIndex(df['timestamps']).floor('D')
df
yields
timestamps values
2016-11-15 "2016-11-15 01:00:00" a
2016-11-16 "2016-11-16 02:00:00" b
2016-11-16 "2016-11-16 03:00:00" c
2016-11-18 "2016-11-18 04:00:00" d
Due to the duplicate 2016-11-16 date, an attempt to reindex:
all_days = pd.date_range(df.index.min(), df.index.max(), freq='D')
df.reindex(all_days)
fails with:
...
ValueError: cannot reindex from a duplicate axis
(by this it means the index has duplicates, not that it is itself a dup)
Instead, we can use .loc to look up entries for all dates in range:
df.loc[all_days]
yields
timestamps values
2016-11-15 "2016-11-15 01:00:00" a
2016-11-16 "2016-11-16 02:00:00" b
2016-11-16 "2016-11-16 03:00:00" c
2016-11-17 NaN NaN
2016-11-18 "2016-11-18 04:00:00" d
fillna can be used on the column series to fill blanks if needed.
An alternative approach is resample, which can handle duplicate dates in addition to missing dates. For example:
df.resample('D').mean()
resample is a deferred operation like groupby so you need to follow it with another operation. In this case mean works well, but you can also use many other pandas methods like max, sum, etc.
Here is the original data, but with an extra entry for '2013-09-03':
val
date
2013-09-02 2
2013-09-03 10
2013-09-03 20 <- duplicate date added to OP's data
2013-09-06 5
2013-09-07 1
And here are the results:
val
date
2013-09-02 2.0
2013-09-03 15.0 <- mean of original values for 2013-09-03
2013-09-04 NaN <- NaN b/c date not present in orig
2013-09-05 NaN <- NaN b/c date not present in orig
2013-09-06 5.0
2013-09-07 1.0
I left the missing dates as NaNs to make it clear how this works, but you can add fillna(0) to replace NaNs with zeroes as requested by the OP or alternatively use something like interpolate() to fill with non-zero values based on the neighboring rows.
Here's a nice method to fill in missing dates into a dataframe, with your choice of fill_value, days_back to fill in, and sort order (date_order) by which to sort the dataframe:
def fill_in_missing_dates(df, date_col_name = 'date',date_order = 'asc', fill_value = 0, days_back = 30):
df.set_index(date_col_name,drop=True,inplace=True)
df.index = pd.DatetimeIndex(df.index)
d = datetime.now().date()
d2 = d - timedelta(days = days_back)
idx = pd.date_range(d2, d, freq = "D")
df = df.reindex(idx,fill_value=fill_value)
df[date_col_name] = pd.DatetimeIndex(df.index)
return df
You can always just use DataFrame.merge() utilizing a left join from an 'All Dates' DataFrame to the 'Missing Dates' DataFrame. Example below.
# example DataFrame with missing dates between min(date) and max(date)
missing_df = pd.DataFrame({
'date':pd.to_datetime([
'2022-02-10'
,'2022-02-11'
,'2022-02-14'
,'2022-02-14'
,'2022-02-24'
,'2022-02-16'
])
,'value':[10,20,5,10,15,30]
})
# first create a DataFrame with all dates between specified start<-->end using pd.date_range()
all_dates = pd.DataFrame(pd.date_range(missing_df['date'].min(), missing_df['date'].max()), columns=['date'])
# from the all_dates DataFrame, left join onto the DataFrame with missing dates
new_df = all_dates.merge(right=missing_df, how='left', on='date')
s.asfreq('D').interpolate().asfreq('Q')

Categories