Getting a specific element from a Pandas Dataframe - python

This is probably a rookie question, but I haven't been able to find the solution.
I'm trying to collect some data from Yahoo Finance using pandas.
from pandas_datareader import data
tickers = ['EQNR.OL','BP','CL=F']
start_date = '2001-01-02'
end_date = '2021-02-26'
panel_data = data.DataReader(tickers, 'yahoo', start_date, end_date)
I wanna take a look at the BP stock (I need all 3, so excluding EQNR.OL and CL=F from tickers is not the right solution). I know how to get all the close prices of a single stock:
close_BP = panel_data['Close','BP']
But is there a way I can get all BP data (open, close, high, low) withdrawn from 'panel_data', and not only a specific column like 'close'?
I was thinking something like BP = panel_data[:,'BP'] or BP = panel_data.loc[:,'BP'] but it doesn't work.
A big thanks in advance.

I think what you're looking for is pandas.IndexSlice...
import pandas as pd
panel_data.loc[:, pd.IndexSlice[:, 'BP']]

You can swap multi-columns and then extract them with loc.
panel_data = panel_data.swaplevel(axis=1)
panel_data.loc[:, ('BP',)]
Attributes Adj Close Close High Low Open Volume
Date
2021-01-04 20.551737 20.830000 21.129999 20.549999 21.090000 14485100.0
2021-01-05 22.081030 22.379999 22.780001 21.370001 21.430000 25447500.0
2021-01-06 23.097271 23.410000 23.860001 22.940001 23.370001 25221400.0
2021-01-07 23.590591 23.910000 24.150000 23.500000 23.719999 16470700.0
2021-01-08 24.074045 24.400000 24.490000 23.990000 24.170000 20189600.0
...

Related

Get records for the nearest date if record does not exist for a particular date

I have a pandas dataframe of stock records, my goal is to pass in a particular 'day' e.g 8 and get the filtered data frame for the 8th of each month and year in the dataset.
I have gone through some SO questions and managed to get one part of my requirement that was getting the records for a particular day, however if the data for say '8th' does not exist for the particular month and year, I need to get the records for the closest day where record exists for this particular month and year.
As an example, if I pass in 8th and there is no record for 8th Jan' 2022, I need to see if records exists for 7th and 9th Jan'22, and so on..and get the record for the nearest date.
If record is present in both 7th and 9th, I will get the record for 9th (higher date).
However, it is possible if the record for 7th exists and 9th does not exist, then I will get the record for 7th (closest).
Code I have written so far
filtered_df = data.loc[(data['Date'].dt.day == 8)]
If the dataset is required, please let me know. I tried to make it clear but if there is any doubt, please let me know. Any help in the correct direction is appreciated.
Alternative 1
Resample to a daily resolution, selecting the nearest day to fill in missing values:
df2 = df.resample('D').nearest()
df2 = df2.loc[df2.index.day == 8]
Alternative 2
A more general method (and a tiny bit faster) is to generate dates/times of your choice, then use reindex() and method 'nearest'. It is more general because you can use any series of timestamps you could come up with (not necessarily aligned with any frequency).
dates = pd.date_range(
start=df.first_valid_index().normalize(), end=df.last_valid_index(),
freq='D')
dates = dates[dates.day == 8]
df2 = df.reindex(dates, method='nearest')
Example
Let's start with a reproducible example:
import yfinance as yf
df = yf.download(['AAPL', 'AMZN'], start='2022-01-01', end='2022-12-31', freq='D')
>>> df.iloc[:10, :5]
Adj Close Close High
AAPL AMZN AAPL AMZN AAPL
Date
2022-01-03 180.959747 170.404495 182.009995 170.404495 182.880005
2022-01-04 178.663086 167.522003 179.699997 167.522003 182.940002
2022-01-05 173.910645 164.356995 174.919998 164.356995 180.169998
2022-01-06 171.007523 163.253998 172.000000 163.253998 175.300003
2022-01-07 171.176529 162.554001 172.169998 162.554001 174.139999
2022-01-10 171.196426 161.485992 172.190002 161.485992 172.500000
2022-01-11 174.069748 165.362000 175.080002 165.362000 175.179993
2022-01-12 174.517136 165.207001 175.529999 165.207001 177.179993
2022-01-13 171.196426 161.214005 172.190002 161.214005 176.619995
2022-01-14 172.071335 162.138000 173.070007 162.138000 173.779999
Now:
df2 = df.resample('D').nearest()
df2 = df2.loc[df2.index.day == 8]
>>> df2.iloc[:5, :5]
Adj Close Close High
AAPL AMZN AAPL AMZN AAPL
2022-01-08 171.176529 162.554001 172.169998 162.554001 174.139999
2022-02-08 174.042633 161.413498 174.830002 161.413498 175.350006
2022-03-08 156.730942 136.014496 157.440002 136.014496 162.880005
2022-04-08 169.323975 154.460495 170.089996 154.460495 171.779999
2022-05-08 151.597595 108.789001 152.059998 108.789001 155.830002
Warning
Replacing a missing day with data from the future (which is what happens when the nearest day is after the missing one) is called peak-ahead and can cause peak-ahead bias in quant research that would use that data. It is usually considered dangerous. You'd be safer using method='ffill'.

How do I continiously calculate something based on the past X amount of data? (Please see info for more details)

Goal:
Calculate 50day moving average for each day, based on the past 50 days. I can calculate the mean for the entire dataset, but I am trying to contiously calculate the mean based on the past 50 days...with it changing each day of course!
import numpy as np
import pandas_datareader.data as pdr
import pandas as pd
# Define the instruments to download. We would like to see Apple, Microsoft and the S&P500 index.
ticker = ['AAPL']
#Define the data period that you would like
start_date = '2017-07-01'
end_date = '2019-02-08'
# User pandas_reader.data.DataReader to load the stock prices from Yahoo Finance.
df = pdr.DataReader(ticker, 'yahoo', start_date, end_date)
# Yahoo Finance gives 'High', 'Low', 'Open', 'Close', 'Volume', 'Adj Close'.
#Export Close PRice, Volume, and Date from yahoo finance
CloseP = df['Close']
CloseP.head()
Volm = df['Volume']
Volm.head()
Date = df["Date"] = df.index
#create a table with Date, Close Price, and Volume
Table = pd.DataFrame(np.array(Date), columns = ['Date'])
Table['Close Price'] = np.array(CloseP)
Table['Volume'] = np.array(Volm)
print (Table)
#create a column that contiosuly calculates 50 day MA
#This is what I can't get to work!
MA = np.mean(df['Close'])
Table['Moving Average'] = np.array(MA)
print (Table)
First of all, please, don't use CamelCase to name your variables, as they look as class names otherwise.
Next, use merge() to join your data frames instead of those yours np.array way:
>>> table = CloseP.merge(Volm, left_index=True, right_index=True)
>>> table.columns = ['close', 'volume'] # give names to columns
>>> table.head(10)
close volume
Date
2017-07-03 143.500000 14277800.0
2017-07-05 144.089996 21569600.0
2017-07-06 142.729996 24128800.0
2017-07-07 144.179993 19201700.0
2017-07-10 145.059998 21090600.0
2017-07-11 145.529999 19781800.0
2017-07-12 145.740005 24884500.0
2017-07-13 147.770004 25199400.0
2017-07-14 149.039993 20132100.0
2017-07-17 149.559998 23793500.0
Finally, use combination of rolling(), mean() and dropna() to calculate moving average:
>>> ma50 = table.rolling(window=50).mean().dropna()
>>> ma50.head(10)
close volume
Date
2017-09-12 155.075401 26092540.0
2017-09-13 155.398401 26705132.0
2017-09-14 155.682201 26748954.0
2017-09-15 156.025201 27248670.0
2017-09-18 156.315001 27430024.0
2017-09-19 156.588401 27424424.0
2017-09-20 156.799201 28087816.0
2017-09-21 156.952201 28340360.0
2017-09-22 157.034601 28769280.0
2017-09-25 157.064801 29254384.0
Please, refer to the docs of mentioned API calls to get more info about their usage. Good luck!

Pandas DataReader handle RemoteError yahoo finance

I am using the below code to parse a large tickers list to yahoo datareader, I am trying to get back a dataframe as per below. If the list is large, I often get a RemoteError back but on different tickers each time. I am not sure how to handle the RemoteError and I am happy to drop the ticker and continue with the next ticker in the list. I would, however, like to try again to get adj close ticker data. I thought using a for loop and adding a time delay would help with yahoo requests but I am still getting a Remote error. Any ideas?
IBM MSFT ORCL TSLA YELP
Date
2014-01-02 184.52 36.88 37.61 150.10 67.92
2014-01-03 185.62 36.64 37.51 149.56 67.66
2014-01-06 184.99 35.86 37.36 147.00 71.72
2014-01-07 188.68 36.14 37.74 149.36 72.66
2014-01-08 186.95 35.49 37.61 151.28 78.42
import pandas_datareader.data as web
import datetime as dt
import pandas as pd
import time
from pandas_datareader._utils import RemoteDataError
Which_group = ['Accident & Health Insurance'] ##<<<<put in group here
df = pd.read_csv('/home/ross/Downloads/UdemyPairs/stocks1.csv')
df.set_index('categoryName', inplace = True)
df1 = df.loc[Which_group]
tickers = df1.Ticker.tolist()
print(tickers)
#tickers = ['SPY', 'AAPL', 'MSFT'] # add as many tickers
start = dt.datetime(2013, 1,1)
end = dt.datetime.today()
# Function starts here
def get_previous_close(strt, end, tick_list, this_price):
""" arg: `this_price` can take str Open, High, Low, Close, Volume"""
#make an empty dataframe in which we will append columns
adj_close = pd.DataFrame([])
# loop here.
for idx, i in enumerate(tick_list):
try:
# time.sleep(0.01)
total = web.DataReader(i, 'yahoo', strt, end)
adj_close[i] = total[this_price]
except RemoteDataError:
pass
return adj_close
#call the function
print(get_previous_close(start, end, tickers, 'Adj Close'))
Maybe you can look at this question. This proposes a solution that it might work for you.
Pandas Dataframe - RemoteDataError - Python

Stock price Import issue for a newbie

a total newbie who started this week on python. I have been reading Datacamp and some other online resources as well as Python without fear.
I wanted to test and see if I can import some data prices and copied code from the internet. I cannot get it to work due to an error: TypeError: string indices must be integers on line 10
import pandas_datareader as pdr #needed to read data from yahoo
#df = pdr.get_data_yahoo('AAPL')
#print (df.Close)
stock =('AAPL')
start_date = '2017-01-01'
end_date = '2017-12-10'
closes = [c['Close'] for c in pdr.get_data_yahoo(stock, start_date,
end_date)]
for c in closes:
print (c)
The line closes = [c.......] is giving me an error.
Any advice on how to fix this? I am starting my journey and actually trying to import the close prices for past year for S&P500 and then save them to Excel. If there is a snippet which does this already and I can learn from, please let me know.
Thank you all.
The call to get_data_yahoo returns a single dataframe.
df = pdr.get_data_yahoo(stock, start_date, end_date)
df.head()
Open High Low Close Adj Close \
Date
2017-01-03 115.800003 116.330002 114.760002 116.150002 114.311760
2017-01-04 115.849998 116.510002 115.750000 116.019997 114.183815
2017-01-05 115.919998 116.860001 115.809998 116.610001 114.764473
2017-01-06 116.779999 118.160004 116.470001 117.910004 116.043915
2017-01-09 117.949997 119.430000 117.940002 118.989998 117.106812
Volume
Date
2017-01-03 28781900
2017-01-04 21118100
2017-01-05 22193600
2017-01-06 31751900
2017-01-09 33561900
type(df)
pandas.core.frame.DataFrame
Meanwhile, you're trying to iterate over this returned dataframe. By default, a for loop will iterate over the columns. For example:
for c in df:
print(c)
Open
High
Low
Close
Adj Close
Volume
When you replicate this code in a list comp, c is given each column name in turn, and str[str] is an invalid operation.
In summary, just doing closes = df['Closes'] on the returned result is sufficient to obtain the Closes column.
I think you are simply looking to dump the dataframe to an Excel spreadsheet. This will get you there.
import pandas as pd
import pandas_datareader as pdr
df = pdr.get_data_yahoo('AAPL')
df.head(2)
Out[12]:
Open High Low Close Adj Close Volume
Date
2009-12-31 30.447144 30.478571 30.08 30.104286 26.986492 88102700
2010-01-04 30.490000 30.642857 30.34 30.572857 27.406532 123432400
df.to_excel('dump_aapl.xlsx')
If you just want the Close column:
df['Close'].to_excel('dump_aapl.xlsx')

Convert daily pandas stock data to monthly data using first trade day of the month

I have a set of calculated OHLCVA daily securities data in a pandas dataframe like this:
>>> type(data_dy)
<class 'pandas.core.frame.DataFrame'>
>>> data_dy
Open High Low Close Volume Adj Close
Date
2012-12-28 140.64 141.42 139.87 140.03 148806700 134.63
2012-12-31 139.66 142.56 139.54 142.41 243935200 136.92
2013-01-02 145.11 146.15 144.73 146.06 192059000 140.43
2013-01-03 145.99 146.37 145.34 145.73 144761800 140.11
2013-01-04 145.97 146.61 145.67 146.37 116817700 140.72
[5 rows x 6 columns]
I'm using the following dictionary and the pandas resample function to convert the dataframe to monthly data:
>>> ohlc_dict = {'Open':'first','High':'max','Low':'min','Close': 'last','Volume': 'sum','Adj Close': 'last'}
>>> data_dy.resample('M', how=ohlc_dict, closed='right', label='right')
Volume Adj Close High Low Close Open
Date
2012-12-31 392741900 136.92 142.56 139.54 142.41 140.64
2013-01-31 453638500 140.72 146.61 144.73 146.37 145.11
[2 rows x 6 columns]
This does the calculations correctly, but I'd like to use the Yahoo! date convention for monthly data of using the first trading day of the period rather than the last calendar day of the period that pandas uses.
So I'd like the answer set to be:
Volume Adj Close High Low Close Open
Date
2012-12-28 392741900 136.92 142.56 139.54 142.41 140.64
2013-01-02 453638500 140.72 146.61 144.73 146.37 145.11
I could do this by converting the daily data to a python list, process the data and return the data to a dataframe, but how do can this be done with pandas?
Instead of M you can pass MS as the resample rule:
df =pd.DataFrame( range(72), index = pd.date_range('1/1/2011', periods=72, freq='D'))
#df.resample('MS', how = 'mean') # pandas <0.18
df.resample('MS').mean() # pandas >= 0.18
Updated to use the first business day of the month respecting US Federal Holidays:
df =pd.DataFrame( range(200), index = pd.date_range('12/1/2012', periods=200, freq='D'))
from pandas.tseries.offsets import CustomBusinessMonthBegin
from pandas.tseries.holiday import USFederalHolidayCalendar
bmth_us = CustomBusinessMonthBegin(calendar=USFederalHolidayCalendar())
df.resample(bmth_us).mean()
if you want custom starts of the month using the min month found in the data try this. (It isn't pretty, but it should work).
month_index =df.index.to_period('M')
min_day_in_month_index = pd.to_datetime(df.set_index(new_index, append=True).reset_index(level=0).groupby(level=0)['level_0'].min())
custom_month_starts =CustomBusinessMonthBegin(calendar = min_day_in_month_index)
Pass custom_start_months to the fist parameter of resample
Thank you J Bradley, your solution worked perfectly. I did have to upgrade my version of pandas from their official website though as the version installed via pip did not have CustomBusinessMonthBegin in pandas.tseries.offsets. My final code was:
#----- imports -----
import pandas as pd
from pandas.tseries.offsets import CustomBusinessMonthBegin
import pandas.io.data as web
#----- get sample data -----
df = web.get_data_yahoo('SPY', '2012-12-01', '2013-12-31')
#----- build custom calendar -----
month_index =df.index.to_period('M')
min_day_in_month_index = pd.to_datetime(df.set_index(month_index, append=True).reset_index(level=0).groupby(level=0)['Open'].min())
custom_month_starts = CustomBusinessMonthBegin(calendar = min_day_in_month_index)
#----- convert daily data to monthly data -----
ohlc_dict = {'Open':'first','High':'max','Low':'min','Close': 'last','Volume': 'sum','Adj Close': 'last'}
mthly_ohlcva = df.resample(custom_month_starts, how=ohlc_dict)
This yielded the following:
>>> mthly_ohlcva
Volume Adj Close High Low Close Open
Date
2012-12-03 2889875900 136.92 145.58 139.54 142.41 142.80
2013-01-01 2587140200 143.92 150.94 144.73 149.70 145.11
2013-02-01 2581459300 145.76 153.28 148.73 151.61 150.65
2013-03-01 2330972300 151.30 156.85 150.41 156.67 151.09
2013-04-01 2907035000 154.20 159.72 153.55 159.68 156.59
2013-05-01 2781596000 157.84 169.07 158.10 163.45 159.33
2013-06-03 3533321800 155.74 165.99 155.73 160.42 163.83
2013-07-01 2330904500 163.78 169.86 160.22 168.71 161.26
2013-08-01 2283131700 158.87 170.97 163.05 163.65 169.99
2013-09-02 2226749600 163.90 173.60 163.70 168.01 165.23
2013-10-01 2901739000 171.49 177.51 164.53 175.79 168.14
2013-11-01 1930952900 176.57 181.75 174.76 181.00 176.02
2013-12-02 2232775900 181.15 184.69 177.32 184.69 181.09
I've seen in the last version of pandas you can use time offset alias 'BMS', which stands for "business month start frequency" or 'BM', which stands for "business month end frequency".
The code in the first case would look like
data_dy.resample('BMS', closed='right', label='right').apply(ohlc_dict)
or, in the second case,
data_dy.resample('BM', closed='right', label='right').apply(ohlc_dict)

Categories