I have the following DataFrame with a Date column,
0 2021-12-13
1 2021-12-10
2 2021-12-09
3 2021-12-08
4 2021-12-07
...
7990 1990-01-08
7991 1990-01-05
7992 1990-01-04
7993 1990-01-03
7994 1990-01-02
I am trying to find the index for a specific date in this DataFrame using the following code,
# import raw data into DataFrame
df = pd.DataFrame.from_records(data['dataset']['data'])
df.columns = data['dataset']['column_names']
df['Date'] = pd.to_datetime(df['Date'])
# sample date to search for
sample_date = dt.date(2021,12,13)
print(sample_date)
# return index of sample date
date_index = df.index[df['Date'] == sample_date].tolist()
print(date_index)
The output of the program is,
2021-12-13
[]
I can't understand why. I have cast the Date column in the DataFrame to a DateTime and I'm doing a like-for-like comparison.
I have reproduced your Dataframe with minimal samples. By changing the way that you can compare the date will work like this below.
import pandas as pd
import datetime as dt
df = pd.DataFrame({'Date':['2021-12-13','2021-12-10','2021-12-09','2021-12-08']})
df['Date'] = pd.to_datetime(df['Date'].astype(str), format='%Y-%m-%d')
sample_date = dt.datetime.strptime('2021-12-13', '%Y-%m-%d')
date_index = df.index[df['Date'] == sample_date].tolist()
print(date_index)
output:
[0]
The search data was in the index number 0 of the DataFrame
Please let me know if this one has any issues
Related
I want to create a function that counts the days as an integer between a date and the date shifted back a number of periods (e.g. df['new_col'] = (df['date'].shift(#periods)-df['date']). The date variable is datetime64[D].
As an example: df['report_date'].shift(39) = '2008-09-26' and df['report_date'] = '2008-08-18' and df['delta'] = 39.
import pandas as pd
from datetime import datetime
from datetime import timedelta
import datetime as dt
dates =pd.Series(np.tile(['2012-08-01','2012-08-15','2012-09-01','2012-08-15'],4)).astype('datetime64[D]')
dates2 =pd.Series(np.tile(['2012-08-01','2012-09-01','2012-10-01','2012-11-01'],4)).astype('datetime64[D]')
stocks = ['A','A','A','A','G','G','G','G','B','B','B','B','F','F','F','F']
stocks = pd.Series(stocks)
df = pd.DataFrame(dict(stocks = stocks, dates = dates,report_date = dates2)).reset_index()
df.head()
print('df info:',df.info())
The code below is my latest attempt to create this variable, but the code produces incorrect results.
df['delta'] = df.groupby(['stocks','dates'])['report_date'].transform(lambda x: (x.shift(1).rsub(x).dt.days))
I came up with the solution of using a for loop and zip function, to simply subtract each pair like so...
from datetime import datetime
import pandas as pd
dates = ['2012-08-01', '2012-08-15', '2012-09-01', '2012-08-15']
dates2 = ['2012-08-01', '2012-09-01', '2012-10-01', '2012-11-01']
diff = []
for i, x in zip(dates, dates2):
i = datetime.strptime(i, '%Y-%m-%d')
x = datetime.strptime(x, '%Y-%m-%d')
diff.append(i - x)
df = {'--col1--': dates, '--col2--': dates2, '--difference--': diff}
df = pd.DataFrame(df)
print(df)
Ouput:
--col1-- --col2-- --difference--
0 2012-08-01 2012-08-01 0 days
1 2012-08-15 2012-09-01 -17 days
2 2012-09-01 2012-10-01 -30 days
3 2012-08-15 2012-11-01 -78 days
Process finished with exit code 0
I hope that solves your problem.
Hi i am looking for a more elegant solution than my code. i have a given df which look like this:
import pandas as pd
from pandas.tseries.offsets import DateOffset
sdate = date(2021,1,31)
edate = date(2021,8,30)
date_range = pd.date_range(sdate,edate-timedelta(days=1),freq='m')
df_test = pd.DataFrame({ 'Datum': date_range})
i take this df and have to insert a new first row with the minimum date
data_perf_indexed_vv = df_test.copy()
minimum_date = df_test['Datum'].min()
data_perf_indexed_vv = data_perf_indexed_vv.reset_index()
df1 = pd.DataFrame([[np.nan] * len(data_perf_indexed_vv.columns)],
columns=data_perf_indexed_vv.columns)
data_perf_indexed_vv = df1.append(data_perf_indexed_vv, ignore_index=True)
data_perf_indexed_vv['Datum'].iloc[0] = minimum_date - DateOffset(months=1)
data_perf_indexed_vv.drop(['index'], axis=1)
may somebody have a shorter or more elegant solution. thanks
Instead of writing such big 2nd block of code just make use of:
df_test.loc[len(df_test)+1,'Datum']=(df_test['Datum'].min()-DateOffset(months=1))
Finally make use of sort_values() method:
df_test=df_test.sort_values(by='Datum',ignore_index=True)
Now if you print df_test you will get desired output:
#output
Datum
0 2020-12-31
1 2021-01-31
2 2021-02-28
3 2021-03-31
4 2021-04-30
5 2021-05-31
6 2021-06-30
7 2021-07-31
I'm trying to create a new date column based on an existing date column in my dataframe. I want to take all the dates in the first column and make them the first of the month in the second column so:
03/15/2019 = 03/01/2019
I know I can do this using:
df['newcolumn'] = pd.to_datetime(df['oldcolumn'], format='%Y-%m-%d').apply(lambda dt: dt.replace(day=1)).dt.date
My issues is some of the data in the old column is not valid dates. There is some text data in some of the rows. So, I'm trying to figure out how to either clean up the data before I do this like:
if oldcolumn isn't a date then make it 01/01/1990 else oldcolumn
Or, is there a way to do this with try/except?
Any assistance would be appreciated.
At first we generate some sample data:
df = pd.DataFrame([['2019-01-03'], ['asdf'], ['2019-11-10']], columns=['Date'])
This can be safely converted to datetime
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
mask = df['Date'].isnull()
df.loc[mask, 'Date'] = dt.datetime(1990, 1, 1)
Now you don't need the slow apply
df['New'] = df['Date'] + pd.offsets.MonthBegin(-1)
Try with the argument errors=coerce.
This will return NaT for the text values.
df['newcolumn'] = pd.to_datetime(df['oldcolumn'],
format='%Y-%m-%d',
errors='coerce').apply(lambda dt: dt.replace(day=1)).dt.date
For example
# We have this dataframe
ID Date
0 111 03/15/2019
1 133 01/01/2019
2 948 Empty
3 452 02/10/2019
# We convert Date column to datetime
df['Date'] = pd.to_datetime(df.Date, format='%m/%d/%Y', errors='coerce')
Output
ID Date
0 111 2019-03-15
1 133 2019-01-01
2 948 NaT
3 452 2019-02-10
i have already two datasets each one has 2 columns (date, close)
i want to compare date of the first dataset to the date of the second dataset if they are the same date the close of the second dataset takes the value relative to the date in question else it takes the value of the date of previous day.
This is the dataset https://www.euronext.com/fr/products/equities/FR0000120644-XPAR
https://fr.finance.yahoo.com/quote/%5EFCHI/history?period1=852105600&period2=1528873200&interval=1d&filter=history&frequency=1d
This is my code:
import numpy as np
from datetime import datetime , timedelta
import pandas as pd
#import cac 40 stock index (dataset1)
df = pd.read_csv('cac 40.csv')
df = pd.DataFrame(df)
#import Danone index(dataset2)
df1 = pd.read_excel('Price_Data_Danone.xlsx',header=3)
df1 = pd.DataFrame(df1)
#check the number of observation of both datasets and get the minimum number
if len(df1)>len(df):
size=len(df)
elif len(df1)<len(df):
size=len(df1)
else:
size=len(df)
#get new close values of dataset2 relative to the date in datset1
close1=np.zeros((size))
for i in range(0,size,1):
# find the date of dataset1 in dataset 2
if (df['Date'][i]in df1['Date']):
#get the index of the date and the corresponding value of close and store it in close1
close1[i]=df['close'][df1.loc['Date'][i], df['Date']]
else:
#if the date doesen't exist in datset2
#take value of close of previous date of datatset1
close1[i]=df['close'][df1.loc['Date'][i-1], df['Date']]
This is my trail, i got this error :
KeyError: 'the label [Date] is not in the [index]'
Examples:
we look for the value df['Date'][1] = '5/06/2009' in the column df1['Date']
we get its index in df1['Date']
then close1=df1['close'][index]
else if df['Date'][1] = '5/06/2009' not in df1['Date']
we get the index of the previous date df['Date'][0] = '4/06/2009'
close1=df1['close'][previous index]
Your error happens in line:
close1[i]=df['close'][df1.loc['Date'][i], df['Date']]
If your goal here is to get close value from df given i index you should write:
close[i] = df['close'][i]
See if that helps, unfortunately I don't understand fully what you are trying to accomplish, for example why do you set size to the length of shorter table?
Also, as long as I downloaded correct files, your condition df['Date'][i]in df1['Date'] might not work, one date format uses - and the other \.
Solution
import pandas as pd
pd.set_option('expand_frame_repr', False)
# load both files
df = pd.read_csv('CAC.csv')
df1 = pd.read_csv('DANONE.csv', header=3)
# ensure date format is the same between two
df.Date = pd.to_datetime(df.Date, dayfirst=True)
df1.Date = pd.to_datetime(df1.Date, dayfirst=True)
# you need only Date and Close columns as far as I understand
keep_columns = ['Date', 'Close']
# let's keep only these columns then
df = df[keep_columns]
df1 = df1[keep_columns]
# merge two tables on Date, method is left so that for every row in df we
# 'append' row from df1 if possible, if not there will be NaN value,
# for readability I added suffixes df - CAC and df1 - DANONE
merged = pd.merge(df,
df1,
on='Date',
how='left',
suffixes=['CAC', 'DANONE'])
# now for all missing values in CloseDANONE, so if there is Date in df
# but not in df1 we fill this value with LAST available
merged.CloseDANONE.fillna(method='ffill', inplace=True)
# we get values from CloseDANONE column as long as it's not null
close1 = merged.loc[merged.CloseDANONE.notnull(), 'CloseDANONE'].values
Below you can see:
last 6 values from df - CAC
Date Close
5522 2018-06-06 5457.560059
5523 2018-06-07 5448.359863
5524 2018-06-08 5450.220215
5525 2018-06-11 5473.910156
5526 2018-06-12 5453.370117
5527 2018-06-13 5468.240234
last 6 values from df1 - DANONE:
Date Close
0 2018-06-06 63.86
1 2018-06-07 63.71
2 2018-06-08 64.31
3 2018-06-11 64.91
4 2018-06-12 65.43
last 6 rows from merged:
Date CloseCAC CloseDANONE
5522 2018-06-06 5457.560059 63.86
5523 2018-06-07 5448.359863 63.71
5524 2018-06-08 5450.220215 64.31
5525 2018-06-11 5473.910156 64.91
5526 2018-06-12 5453.370117 65.43
5527 2018-06-13 5468.240234 65.43
For every value that was present in df we get value from df1, but 2018-06-13 is not present in df1 so I fill it with last available value which is 65.43 from 2018-06-12.
I have the following python pandas dataframe df:
DATES Sales
0 1/6/2013 5676
1 1/8/2014 45746
2 1/10/2015 42658
3 1/14/2015 890790
4 1/16/2016 5764
5 1/20/2014 7898
I need to change DATES to a Date Time Index, so that i can resample it.
But when I do this
pd.to_datetime(df,infer_datetime_format=True)
I get the following error:
ValueError: to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing
You should explicitly define the format
pd.to_datetime(df['DATES'],format='%m/%d/%Y')
and not let Pandas guess
to_datetime() documentation
To set a datetime as an index
df = df.set_index(pd.DatetimeIndex(df['DATES']))
Works for non-padded month and day:
import pandas as pd
d = {'1/6/2013' : 5676}
df = pd.DataFrame(d.items(), columns=['DATES', 'Sales'])
df['DATES'] = pd.to_datetime(df['DATES'],format='%m/%d/%Y')
0 2013-01-06