Join in Pandas dataframe (Auto conversion from date to date-time?) - python

I tried to use the join function to combine the close price of all 500 stocks in 5 year period (2013-02-08 to 2018-02-07), where each column represent a stock, with the index of the dataframe being the dates.
But the join function in pandas seems to automatically change the date format (index), rendering all the entries in the combined dataframe to be NaN.
The code to import and preview the file:
import pandas as pd
df= pd.read_csv('all_stocks_5yr.csv')
df.head()
(https://i.stack.imgur.com/29Wq4.png)
# df.info()
df['Name'].unique().shape #There are 505 stock names in total
# dates = pd.date_range(df['date'].min(), df['date'].max()) #check the date range
Single out the close prices:
close_prices = pd.DataFrame(index=dates) #Make the index column to be the dates
# close_prices.head()
symbols = df['Name'].unique(). #Denote the stock names as an array
So I tried to test the result for each stock using the first 3 stocks:
i = 1
for symbol in symbols:
df_sym = df[df['Name']==symbol]
df_tmp = pd.DataFrame(data=df_sym['close'].to_numpy() , index = df_sym['date'], columns=[symbol])
print(df_tmp) #print the temporary dataframes
i += 1
if i >3: break
And the results were expected, a dataframe indexed by date and only one stock:
AAL
date
2013-02-08 14.75
2013-02-11 14.46
2013-02-12 14.27
2013-02-13 14.66
2013-02-14 13.99
... ...
2018-02-01 53.88
2018-02-02 52.10
2018-02-05 49.76
2018-02-06 51.18
2018-02-07 51.40
[1259 rows x 1 columns]
AAPL
date
2013-02-08 67.8542
2013-02-11 68.5614
2013-02-12 66.8428
2013-02-13 66.7156
2013-02-14 66.6556
... ...
2018-02-01 167.7800
2018-02-02 160.5000
2018-02-05 156.4900
2018-02-06 163.0300
2018-02-07 159.5400
[1259 rows x 1 columns]
...
Now here's part I find very confusing: I checked what happens when combining first 3 stock dataframes using join function, with index 'date':
i = 1
for symbol in symbols:
df_sym = df[df['Name']==symbol]
df_tmp = pd.DataFrame(data=df_sym['close'].to_numpy() , index = df_sym['date'], columns=[symbol])
close_prices = close_prices.join(df_tmp)
i += 1
if i >3: break
close_prices.head()
(https://i.stack.imgur.com/MqVDo.png)
Somehow the index changed from date to date-time format, and therefore naturally "join" function will recognize that none of the entries matches with the new index, and automatically put a NA there for every single entry.
What caused the date to have changed to date-time?

You can use pivot:
df = pd.read_csv('all_stocks_5yr.csv')
out = df1.pivot(index='date', columns='Name', values='close')
Output:
>>> df.iloc[:, :8]
Name A AAL AAP AAPL ABBV ABC ABT ACN
date
2013-02-08 45.08 14.75 78.90 67.8542 36.25 46.89 34.41 73.31
2013-02-11 44.60 14.46 78.39 68.5614 35.85 46.76 34.26 73.07
2013-02-12 44.62 14.27 78.60 66.8428 35.42 46.96 34.30 73.37
2013-02-13 44.75 14.66 78.97 66.7156 35.27 46.64 34.46 73.56
2013-02-14 44.58 13.99 78.84 66.6556 36.57 46.77 34.70 73.13
... ... ... ... ... ... ... ... ...
2018-02-01 72.83 53.88 117.29 167.7800 116.34 99.29 62.18 160.46
2018-02-02 71.25 52.10 113.93 160.5000 115.17 96.02 61.69 156.90
2018-02-05 68.22 49.76 109.86 156.4900 109.51 91.90 58.73 151.83
2018-02-06 68.45 51.18 112.20 163.0300 111.20 91.54 58.86 154.69
2018-02-07 68.06 51.40 109.93 159.5400 113.62 94.22 58.67 155.15
[1259 rows x 8 columns]
Source 'all_stocks_5yr.csv': Kaggle

Related

Pandas: compute average and standard deviation by clock time

I have a DataFrame like this:
date time value
0 2019-04-18 07:00:10 100.8
1 2019-04-18 07:00:20 95.6
2 2019-04-18 07:00:30 87.6
3 2019-04-18 07:00:40 94.2
The DataFrame contains value recorded every 10 seconds for entire year 2019. I need to calculate standard deviation and mean/average of value for each hour of each date, and create two new columns for them. I have tried first separating the hour for each value like:
df["hour"] = df["time"].astype(str).str[:2]
Then I have tried to calculate standard deviation by:
df["std"] = df.groupby("hour").median().index.get_level_values('value').stack().std()
But that won't work, could I have some advise on the problem?
We can split the time column around the delimiter :, then slice the hour component using str[0], finally group the dataframe on date along with hour component and aggregate column value with mean and std:
hr = df['time'].str.split(':', n=1).str[0]
df.groupby(['date', hr])['value'].agg(['mean', 'std'])
If you want to broadcast the aggregated values to original dataframe, then we need to use transform instead of agg:
g = df.groupby(['date', df['time'].str.split(':', n=1).str[0]])['value']
df['mean'], df['std'] = g.transform('mean'), g.transform('std')
date time value mean std
0 2019-04-18 07:00:10 100.8 94.55 5.434151
1 2019-04-18 07:00:20 95.6 94.55 5.434151
2 2019-04-18 07:00:30 87.6 94.55 5.434151
3 2019-04-18 07:00:40 94.2 94.55 5.434151
have synthesized data. Start by generating a true datetime column
groupby() hour
use describe() to get mean & std
merge() back to original data frame
d = pd.date_range("1-Jan-2019", "28-Feb-2019", freq="10S")
df = pd.DataFrame({"datetime":d, "value":np.random.uniform(70,90,len(d))})
df = df.assign(date=df.datetime.dt.strftime("%Y-%m-%d"),
time=df.datetime.dt.strftime("%H:%M:%S"))
# create a datetime column - better than manipulating strings
df["datetime"] = pd.to_datetime(df.date + " " + df.time)
# calc mean & std by hour
dfh = (df.groupby(df.datetime.dt.hour, as_index=False)
.apply(lambda dfa: dfa.describe().T.loc[:,["mean","std"]].reset_index(drop=True))
.droplevel(1)
)
# merge mean & std by hour back
df.merge(dfh, left_on=df.datetime.dt.hour, right_index=True).drop(columns="key_0")
datetime value mean std
0 2019-01-01 00:00:00 86.014209 80.043364 5.777724
1 2019-01-01 00:00:10 77.241141 80.043364 5.777724
2 2019-01-01 00:00:20 71.650739 80.043364 5.777724
3 2019-01-01 00:00:30 71.066332 80.043364 5.777724
4 2019-01-01 00:00:40 77.203291 80.043364 5.777724
... ... ... ... ...
3144955 2019-12-30 23:59:10 89.577237 80.009751 5.773007
3144956 2019-12-30 23:59:20 82.154883 80.009751 5.773007
3144957 2019-12-30 23:59:30 82.131952 80.009751 5.773007
3144958 2019-12-30 23:59:40 85.346724 80.009751 5.773007
3144959 2019-12-30 23:59:50 78.122761 80.009751 5.773007

Count values greater than threshold and assign to appropriate year pandas

I have a dataframe that looks like this:
Date DFW
242 2000-05-01 00:00:00 75.92
243 2000-05-01 12:00:00 75.02
244 2000-05-02 00:00:00 71.96
245 2000-05-02 12:00:00 75.92
246 2000-05-03 00:00:00 71.96
... ... ...
14991 2020-07-09 12:00:00 93.90
14992 2020-07-10 00:00:00 91.00
14993 2020-07-10 12:00:00 93.00
14994 2020-07-11 00:00:00 89.10
14995 2020-07-11 12:00:00 97.00
The df contains the max value of temperature for a specific location every 12 hours from May - July 11 during 2000-2020. I want to count the number of times that the value is >90 and then store that value in a column where the row is the year. Should I use groupby to accomplish this?
Expected output:
Year count
2000 x
2001 y
... ...
2019 z
2020 a
You can do with groupby:
# extract the years from dates
years = df['Date'].dt.year
# compare `DFW` with `90`
# gt90 will be just True or False
gt90 = df['DFW'].gt(90)
# sum the `True` by years
output = gt90.groupby(years).sum()
# set the years as normal column:
output = output.reset_index()
All that in one line:
df['DFW'].gt(90).groupby().sum().reset_index()
One possible approach is to extract and create a new column for year (let's say "year") and then,
df[df['DFW'] > 90].groupby('year').count().reset_index()

How to count total days in pandas dataframe

I have a df column with dates and hours / minutes:
0 2019-09-13 06:00:00
1 2019-09-13 06:05:00
2 2019-09-13 06:10:00
3 2019-09-13 06:15:00
4 2019-09-13 06:20:00
Name: Date, dtype: datetime64[ns]
I need to count how many days the dataframe contains.
I tried it like this:
sample_length = len(df.groupby(df['Date'].dt.date).first())
and
sample_length = len(df.groupby(df['Date'].dt.date))
But the number I get seems wrong. Do you know another method of counting the days?
Consider the sample dates:
sample = pd.date_range('2019-09-12 06:00:00', periods=50, freq='4h')
df = pd.DataFrame({'date': sample})
date
0 2019-09-12 06:00:00
1 2019-09-12 10:00:00
2 2019-09-12 14:00:00
3 2019-09-12 18:00:00
4 2019-09-12 22:00:00
5 2019-09-13 02:00:00
6 2019-09-13 06:00:00
...
47 2019-09-20 02:00:00
48 2019-09-20 06:00:00
49 2019-09-20 10:00:00
Use, DataFrame.groupby to group the dataframe on df['date'].dt.date and use the aggregate function GroupBy.size:
count = df.groupby(df['date'].dt.date).size()
# print(count)
date
2019-09-12 5
2019-09-13 6
2019-09-14 6
2019-09-15 6
2019-09-16 6
2019-09-17 6
2019-09-18 6
2019-09-19 6
2019-09-20 3
dtype: int64
I'm not completely sure what you want to do here. Do you want to count the number of unique days (Monday/Tuesday/...), monthly dates (1-31 ish), yearly dates (1-365), or unique dates (unique days since the dawn of time)?
From a pandas series, you can use {series}.value_counts() to get the number of entries for each unique value, or simply get all unique values with {series}.unique()
import pandas as pd
df = pd.DataFrame(pd.DatetimeIndex(['2016-10-08 07:34:13', '2015-11-15 06:12:48',
'2015-01-24 10:11:04', '2015-03-26 16:23:53',
'2017-04-01 00:38:21', '2015-03-19 03:47:54',
'2015-12-30 07:32:32', '2015-11-10 20:39:36',
'2015-06-24 05:48:09', '2015-03-19 16:05:19'],
dtype='datetime64[ns]', freq=None), columns = ["date"])
days (Monday/Tuesday/...):
df.date.dt.dayofweek.value_counts()
monthly dates (1-31 ish)
df.date.dt.day.value_counts()
yearly dates (1-365)
df.date.dt.dayofyear.value_counts()
unique dates (unique days since the dawn of time)
df.date.dt.date.value_counts()
To get the number of unique entries from any of the above, simply add .shape[0]
In order to calculate the total number of unique dates in the given time series data example we can use:
print(len(pd.to_datetime(df['Date']).dt.date.unique()))
import pandas as pd
df = pd.DataFrame({'Date': ['2019-09-13 06:00:00',
'2019-09-13 06:05:00',
'2019-09-13 06:10:00',
'2019-09-13 06:15:00',
'2019-09-13 06:20:00']
},
dtype = 'datetime64[ns]'
)
df = df.set_index('Date')
_count_of_days = df.resample('D').first().shape[0]
print(_count_of_days)

Merging data frames based on value in row and column name

I work with financial data and try to merge two pandas data frames.
In the first data frame I have the information of company name, ticker code, and date.
Date Ticker Company
0 2020-01-15 CHR.CO Chr. Hansen
1 2020-01-15 PNDORA.CO Pandora A/S
In my second df, I have a date and closing prices for stocks on some given dates.
Date CHR.CO COLO-B.CO DANSKE.CO PNDORA.CO VWS.CO
0 2020-01-15 89.5 89.5 187.39 54.4 552.0
1 2020-01-16 90 88.0 184.61 55.2 550.0
How could I merge these two data frames so I could get the closing stock price in the first dataframe?
Here's the desired output:
Date Ticker Company Close_price
0 2020-01-15 CHR.CO Chr. Hansen 89.5
1 2020-01-15 PNDORA.CO Pandora A/S 54.4
Using the below line I merge the two dataframes on date, but also get all the tickers and the close price for all companies.
full = new_df.merge(stocks_close, on = "Date")
Add DataFrame.melt before merge and also specify both columns ["Date",'Ticker'] in parameter on:
df = stocks_close.melt(id_vars='Date', var_name='Ticker', value_name='Close_price')
full = new_df.merge(df, on = ["Date",'Ticker'])
print (full)
Date Ticker Company Close_price
0 2020-01-15 CHR.CO Chr. Hansen 89.5
1 2020-01-15 PNDORA.CO Pandora A/S 54.4

Pandas Subset of a Time Series Without Resampling

I have a pandas data series with cumulative daily returns for a series:
Date CumReturn
3/31/2017 1
4/3/2017 .99
4/4/2017 .992
... ...
4/28/2017 1.012
5/1/2017 1.011
... ...
5/31/2017 1.022
... ...
6/30/2017 1.033
... ...
I want only the month-end values.
Date CumReturn
4/28/2017 1.012
5/31/2017 1.022
6/30/2017 1.033
Because I want only the month-end values, resampling doesn't work as it aggregates the interim values.
What is the easiest way to get only the month end values as they appear in the original dataframe?
Use the is_month_end component of the .dt date accessor:
# Ensure the date column is a Timestamp
df['Date'] = pd.to_datetime(df['Date'])
# Filter to end of the month only
df = df[df['Date'].dt.is_month_end]
Applying this to the data you provided:
Date CumReturn
0 2017-03-31 1.000
5 2017-05-31 1.022
6 2017-06-30 1.033
EDIT
To get business month end, compare using BMonthEnd(0):
from pandas.tseries.offsets import BMonthEnd
# Ensure the date column is a Timestamp
df['Date'] = pd.to_datetime(df['Date'])
# Filter to end of the month only
df = df[df['Date'] == df['Date'] + BMonthEnd(0)]
Applying this to the data you provided:
Date CumReturn
0 2017-03-31 1.000
3 2017-04-28 1.012
5 2017-05-31 1.022
6 2017-06-30 1.033
df.sort_values('Date').groupby([df.Date.dt.year,df.Date.dt.month]).last()
Out[197]:
Date CumReturn
Date Date
2017 3 2017-03-31 1.000
4 2017-04-28 1.012
5 2017-05-31 1.022
6 2017-06-30 1.033
Assuming that the dataframe is already sorted by 'Date' and that the values in that column are Pandas timestamps, you can convert them to YYYY-mm string values for grouping and take the last value:
df.groupby(df['Date'].dt.strftime('%Y-%m'))['CumReturn'].last()
# Example output:
# 2017-01 0.127002
# 2017-02 0.046894
# 2017-03 0.005560
# 2017-04 0.150368

Categories