i have already two datasets each one has 2 columns (date, close)
i want to compare date of the first dataset to the date of the second dataset if they are the same date the close of the second dataset takes the value relative to the date in question else it takes the value of the date of previous day.
This is the dataset https://www.euronext.com/fr/products/equities/FR0000120644-XPAR
https://fr.finance.yahoo.com/quote/%5EFCHI/history?period1=852105600&period2=1528873200&interval=1d&filter=history&frequency=1d
This is my code:
import numpy as np
from datetime import datetime , timedelta
import pandas as pd
#import cac 40 stock index (dataset1)
df = pd.read_csv('cac 40.csv')
df = pd.DataFrame(df)
#import Danone index(dataset2)
df1 = pd.read_excel('Price_Data_Danone.xlsx',header=3)
df1 = pd.DataFrame(df1)
#check the number of observation of both datasets and get the minimum number
if len(df1)>len(df):
size=len(df)
elif len(df1)<len(df):
size=len(df1)
else:
size=len(df)
#get new close values of dataset2 relative to the date in datset1
close1=np.zeros((size))
for i in range(0,size,1):
# find the date of dataset1 in dataset 2
if (df['Date'][i]in df1['Date']):
#get the index of the date and the corresponding value of close and store it in close1
close1[i]=df['close'][df1.loc['Date'][i], df['Date']]
else:
#if the date doesen't exist in datset2
#take value of close of previous date of datatset1
close1[i]=df['close'][df1.loc['Date'][i-1], df['Date']]
This is my trail, i got this error :
KeyError: 'the label [Date] is not in the [index]'
Examples:
we look for the value df['Date'][1] = '5/06/2009' in the column df1['Date']
we get its index in df1['Date']
then close1=df1['close'][index]
else if df['Date'][1] = '5/06/2009' not in df1['Date']
we get the index of the previous date df['Date'][0] = '4/06/2009'
close1=df1['close'][previous index]
Your error happens in line:
close1[i]=df['close'][df1.loc['Date'][i], df['Date']]
If your goal here is to get close value from df given i index you should write:
close[i] = df['close'][i]
See if that helps, unfortunately I don't understand fully what you are trying to accomplish, for example why do you set size to the length of shorter table?
Also, as long as I downloaded correct files, your condition df['Date'][i]in df1['Date'] might not work, one date format uses - and the other \.
Solution
import pandas as pd
pd.set_option('expand_frame_repr', False)
# load both files
df = pd.read_csv('CAC.csv')
df1 = pd.read_csv('DANONE.csv', header=3)
# ensure date format is the same between two
df.Date = pd.to_datetime(df.Date, dayfirst=True)
df1.Date = pd.to_datetime(df1.Date, dayfirst=True)
# you need only Date and Close columns as far as I understand
keep_columns = ['Date', 'Close']
# let's keep only these columns then
df = df[keep_columns]
df1 = df1[keep_columns]
# merge two tables on Date, method is left so that for every row in df we
# 'append' row from df1 if possible, if not there will be NaN value,
# for readability I added suffixes df - CAC and df1 - DANONE
merged = pd.merge(df,
df1,
on='Date',
how='left',
suffixes=['CAC', 'DANONE'])
# now for all missing values in CloseDANONE, so if there is Date in df
# but not in df1 we fill this value with LAST available
merged.CloseDANONE.fillna(method='ffill', inplace=True)
# we get values from CloseDANONE column as long as it's not null
close1 = merged.loc[merged.CloseDANONE.notnull(), 'CloseDANONE'].values
Below you can see:
last 6 values from df - CAC
Date Close
5522 2018-06-06 5457.560059
5523 2018-06-07 5448.359863
5524 2018-06-08 5450.220215
5525 2018-06-11 5473.910156
5526 2018-06-12 5453.370117
5527 2018-06-13 5468.240234
last 6 values from df1 - DANONE:
Date Close
0 2018-06-06 63.86
1 2018-06-07 63.71
2 2018-06-08 64.31
3 2018-06-11 64.91
4 2018-06-12 65.43
last 6 rows from merged:
Date CloseCAC CloseDANONE
5522 2018-06-06 5457.560059 63.86
5523 2018-06-07 5448.359863 63.71
5524 2018-06-08 5450.220215 64.31
5525 2018-06-11 5473.910156 64.91
5526 2018-06-12 5453.370117 65.43
5527 2018-06-13 5468.240234 65.43
For every value that was present in df we get value from df1, but 2018-06-13 is not present in df1 so I fill it with last available value which is 65.43 from 2018-06-12.
Related
I have the following DataFrame with a Date column,
0 2021-12-13
1 2021-12-10
2 2021-12-09
3 2021-12-08
4 2021-12-07
...
7990 1990-01-08
7991 1990-01-05
7992 1990-01-04
7993 1990-01-03
7994 1990-01-02
I am trying to find the index for a specific date in this DataFrame using the following code,
# import raw data into DataFrame
df = pd.DataFrame.from_records(data['dataset']['data'])
df.columns = data['dataset']['column_names']
df['Date'] = pd.to_datetime(df['Date'])
# sample date to search for
sample_date = dt.date(2021,12,13)
print(sample_date)
# return index of sample date
date_index = df.index[df['Date'] == sample_date].tolist()
print(date_index)
The output of the program is,
2021-12-13
[]
I can't understand why. I have cast the Date column in the DataFrame to a DateTime and I'm doing a like-for-like comparison.
I have reproduced your Dataframe with minimal samples. By changing the way that you can compare the date will work like this below.
import pandas as pd
import datetime as dt
df = pd.DataFrame({'Date':['2021-12-13','2021-12-10','2021-12-09','2021-12-08']})
df['Date'] = pd.to_datetime(df['Date'].astype(str), format='%Y-%m-%d')
sample_date = dt.datetime.strptime('2021-12-13', '%Y-%m-%d')
date_index = df.index[df['Date'] == sample_date].tolist()
print(date_index)
output:
[0]
The search data was in the index number 0 of the DataFrame
Please let me know if this one has any issues
I would like to filter for customer_id'sthat first appear after a certain date in this case 2019-01-10 and then create a new df with a list of new customers
df
date customer_id
2019-01-01 429492
2019-01-01 344343
2019-01-01 949222
2019-01-10 429492
2019-01-10 344343
2019-01-10 129292
Output df
customer_id
129292
This is what I have tried so far but this gives me also customer_id's that were active before 10th January 2019
s = df.loc[df["date"]>="2019-01-10", "customer_id"]
df_new = df[df["customer_id"].isin(s)]
df_new
You can use boolean indexing with filtering with Series.isin:
df["date"] = pd.to_datetime(df["date"])
mask1 = df["date"]>="2019-01-10"
mask2 = df["customer_id"].isin(df.loc[~mask1,"customer_id"])
df = df.loc[mask1 & ~mask2, ['customer_id']]
print (df)
customer_id
5 129292
df['date'] = pd.to_datetime(df['date'])
cutoff = pd.to_datetime('2019-01-10')
mask = df['date'] >= cutoff
customers_before = df.loc[~mask, 'customer_id'].unique().tolist()
customers_after = df.loc[mask, 'customer_id'].unique().tolist()
result = set(customers_after) - set(customers_before)
"then create a new df with a list of new customers" so in this case your output is null, because 2019-01-10 is last date, there is no new customers after this date
but if you want to get list of customers after certain date or equal than :
df=pd.DataFrame({
'date':['2019-01-01','2019-01-01','2019-01-01',
'2019-01-10','2019-01-10','2019-01-10'],
'customer_id':[429492,344343,949222,429492,344343,129292]
})
certain_date=pd.to_datetime('2019-01-10')
df.date=pd.to_datetime(df.date)
df=df[
df.date>=certain_date
]
print(df)
date customer_id
3 2019-01-10 429492
4 2019-01-10 344343
5 2019-01-10 129292
If your 'date' column has datetime objects you just have to do:
df_new = df[df['date'] >= datetime(2019, 1, 10)]['customer_id']
If your 'date' column doesn't contain datetime objects, you should convert it first it by using to_datetime method:
df['date'] = pd.to_datetime(df['date'])
And then apply the methodology described above.
I have a pandas Dataframe with two date columns (A and B) and I would like to create a 3rd column (C) that holds dates created using month and year from column A and the day of column B. Obviously I would need to change the day for the months that day doesn't exist like we try to create 31st Feb 2020, it would need to change it to 29th Feb 2020.
For example
import pandas as pd
df = pd.DataFrame({'A': ['2020-02-21', '2020-03-21', '2020-03-21'],
'B': ['2020-01-31', '2020-02-11', '2020-02-01']})
for c in df.columns:
dfx[c] = pd.to_datetime(dfx[c])
Then I want to create a new column C that is a new datetime that is:
year = df.A.dt.year
month = df.A.dt.month
day = df.B.dt.day
I don't know how to create this column. Can you please help?
Here is one way to do it, using pandas' time series functionality:
import pandas as pd
# your example data
df = pd.DataFrame({'A': ['2020-02-21', '2020-03-21', '2020-03-21'],
'B': ['2020-01-31', '2020-02-11', '2020-02-01']})
for c in df.columns:
# keep using the same dataframe here
df[c] = pd.to_datetime(df[c])
# set back every date from A to the end of the previous month,
# then add the number of days from the date in B
df['C'] = df.A - pd.offsets.MonthEnd() + pd.TimedeltaIndex(df.B.dt.day, unit='D')
display(df)
Result:
A B C
0 2020-02-21 2020-01-31 2020-03-02
1 2020-03-21 2020-02-11 2020-03-11
2 2020-03-21 2020-02-01 2020-03-01
As you can see in row 0, this handles the case of "February 31st" not quite as you suggested, but still in a logical way.
so the data set I am using is only business days but I want to change the date index such that it reflects every calendar day. When I use reindex and have to use reindex(), I am unsure how to use 'fill value' field of reindex to inherit the value above.
import pandas as pd
idx = pd.date_range("12/18/2019","12/24/2019")
df = pd.Series({'12/18/2019':22.63,
'12/19/2019':22.2,
'12/20/2019':21.03,
'12/23/2019':17,
'12/24/2019':19.65})
df.index = pd.DatetimeIndex(df.index)
df = df.reindex()
Currently, my data set looks like this.
However, when I use reindex I get the below result
In reality I want it to inherit the values directly above if it is a NaN result so the data set becomes the following
Thank you guys for your help!
You were close! You just need to pass the index you want to reindex on (idx in this case) as a parameter to the reindex method, and then you can set the method parameter to 'ffill' to propagate the last valid value forward.
idx = pd.date_range("12/18/2019","12/24/2019")
df = pd.Series({'12/18/2019':22.63,
'12/19/2019':22.2,
'12/20/2019':21.03,
'12/23/2019':17,
'12/24/2019':19.65})
df.index = pd.DatetimeIndex(df.index)
df = df.reindex(idx, method='ffill')
It seems that you have created a 'Series', not a dataframe. See if the code below helps you.
df = df.to_frame().reset_index() #to convert series to dataframe
df = df.fillna(method='ffill')
print(df)
Output You will have to rename columns
index 0
0 2019-12-18 22.63
1 2019-12-19 22.20
2 2019-12-20 21.03
3 2019-12-21 21.03
4 2019-12-22 21.03
5 2019-12-23 17.00
6 2019-12-24 19.65
I'm trying to create a new date column based on an existing date column in my dataframe. I want to take all the dates in the first column and make them the first of the month in the second column so:
03/15/2019 = 03/01/2019
I know I can do this using:
df['newcolumn'] = pd.to_datetime(df['oldcolumn'], format='%Y-%m-%d').apply(lambda dt: dt.replace(day=1)).dt.date
My issues is some of the data in the old column is not valid dates. There is some text data in some of the rows. So, I'm trying to figure out how to either clean up the data before I do this like:
if oldcolumn isn't a date then make it 01/01/1990 else oldcolumn
Or, is there a way to do this with try/except?
Any assistance would be appreciated.
At first we generate some sample data:
df = pd.DataFrame([['2019-01-03'], ['asdf'], ['2019-11-10']], columns=['Date'])
This can be safely converted to datetime
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
mask = df['Date'].isnull()
df.loc[mask, 'Date'] = dt.datetime(1990, 1, 1)
Now you don't need the slow apply
df['New'] = df['Date'] + pd.offsets.MonthBegin(-1)
Try with the argument errors=coerce.
This will return NaT for the text values.
df['newcolumn'] = pd.to_datetime(df['oldcolumn'],
format='%Y-%m-%d',
errors='coerce').apply(lambda dt: dt.replace(day=1)).dt.date
For example
# We have this dataframe
ID Date
0 111 03/15/2019
1 133 01/01/2019
2 948 Empty
3 452 02/10/2019
# We convert Date column to datetime
df['Date'] = pd.to_datetime(df.Date, format='%m/%d/%Y', errors='coerce')
Output
ID Date
0 111 2019-03-15
1 133 2019-01-01
2 948 NaT
3 452 2019-02-10