Data appears when printed but doesn't show up in dataframe - python

#! /usr/lib/python3
import yfinance as yf
import pandas as pd
pd.set_option('display.max_rows', None, 'display.max_columns', None)
# Request stock data from yfinance
ticker = yf.Ticker('AAPL')
# Get all option expiration dates in the form of a list
xdates = ticker.options
# Go through the list of expiry dates one by one
for xdate in xdates:
# Get option chain info for that xdate
option = ticker.option_chain(xdate)
# print out this value, get back 15 columns and 63 rows of information
print(option)
# Put that same data in dataframe
df = pd.DataFrame(data = option)
# Show dataframe
print(df)
Expected: df will show a DataFrame containing the same information that is shown when running print(option), i.e. 15 columns and 63 rows of data, or at least some part of them
Actual:
df shows only two columns with no information
df.shape results in (2,1)
print(df.columns.tolist()) results in [0]
Since the desired info appears when you print it, I'm confused as to why it's not appearing in the dataframe.

The data of option_chain for specific expiration date is avaialable in calls property of the object as dataframe. You don't have to create a new dataframe.
ticker = yf.Ticker('AAPL')
xdates = ticker.options
option = ticker.option_chain(xdates[0])
option.calls # DataFrame
GitHub - yfinance

Related

Combining Successive Pandas Dataframes in One Master Dataframe via a Loop

I'm trying to loop through a series of tickers cleaning the associated dataframes then combining the individual ticker dataframes into one large dataframe with columns named for each ticker. The following code enables me to loop through unique tickers and name the columns of each ticker's dataframe after the specific ticker:
import pandas as pd
def clean_func(tkr,f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f1.index = f1['Date']
keep = ['Col1','Col2']
f2 = f1[keep]
f2.columns = [tkr+'Col1',tkr+'Col2']
return f2
tkrs = ['tkr1','tkr2','tkr3']
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv')
df2 = clean_func(tkr,df1)
However, I don't know how to create a master dataframe where I add each new ticker to the master dataframe. With that in mind, I'd like to align each new ticker's data using the datetime index. So, if tkr1 has data for 6/25/22, 6/26/22, 6/27/22, and tkr2 has data for 6/26/22, and 6/27/22, the combined dataframe would show all three dates but would produce a NaN for ticker 2 on 6/25/22 since there is no data for that ticker on that date.
When not in a loop looking to append each successive ticker to a larger dataframe (as per above), the following code does what I'd like. But it doesn't work when looping and adding new ticker data for each successive loop (or I don't know how to make it work in the confines of a loop).
combined = pd.concat((df1, df2, df3,...,dfn), axis=1)
Many thanks in advance.
You should only create the master DataFrame after the loop. Appending to the master DataFrame in each iteration via pandas.concat is slow since you are creating a new DataFrame every time.
Instead, read each ticker DataFrame, clean it, and append it to a list which store every ticker DataFrames. After the loop create the master DataFrame with all the Dataframes using pandas.concat:
import pandas as pd
def clean_func(tkr,f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f1.index = f1['Date']
keep = ['Col1','Col2']
f2 = f1[keep]
f2.columns = [tkr+'Col1',tkr+'Col2']
return f2
tkrs = ['tkr1','tkr2','tkr3']
dfs_list = []
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv')
df2 = clean_func(tkr,df1)
dfs_list.append(df2)
master_df = pd.concat(dfs_list, axis=1)
As a suggestion here is a cleaner way of defining your clean_func using DataFrame.set_index and DataFrame.add_prefix.
def clean_func(tkr, f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f2 = f1.set_index('Date')[['Col1','Col2']].add_prefix(tkr)
return f2
Or if you want, you can parse the Date column as datetime and set it as index directly in the pd.read_csv call by specifying index_col and parse_dates parameters (honestly, I'm not sure if those two parameters will play well together, and I'm too lazy to test it, but you can try ;)).
import pandas as pd
def clean_func(tkr,f1):
f2 = f1[['Col1','Col2']].add_prefix(tkr)
return f2
tkrs = ['tkr1','tkr2','tkr3']
dfs_list = []
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv', index_col='Date', parse_dates=['Date'])
df2 = clean_func(tkr,df1)
dfs_list.append(df2)
master_df = pd.concat(dfs_list, axis=1)
Before the loop create an empty df with:
combined = pd.DataFrame()
Then within the loop (after loading df1 - see code above):
combined = pd.concat((combined, clean_func(tkr, df1)), axis=1)
If you get:
TypeError: concat() got multiple values for argument 'axis'
Make sure your parentheses are correct per above.
With the code above, you can skip the original step:
df2 = clean_func(tkr,df1)
Since it is embedded in the concat function. Alternatively, you could keep the df2 step and use:
combined = pd.concat((combined,df2), axis=1)
Just make sure the dataframes are encapsulated by parentheses within the concat function.
Same answer as GC123 but here is a full example which mimics reading from separate files and concatenating them
import pandas as pd
import io
fake_file_1 = io.StringIO("""
fruit,store,quantity,unit_price
apple,fancy-grocers,2,9.25
pear,fancy-grocers,3,100
banana,fancy-grocers,1,256
""")
fake_file_2 = io.StringIO("""
fruit,store,quantity,unit_price
banana,bargain-grocers,667,0.01
apple,bargain-grocers,170,0.15
pear,bargain-grocers,281,0.45
""")
fake_files = [fake_file_1,fake_file_2]
combined = pd.DataFrame()
for fake_file in fake_files:
df = pd.read_csv(fake_file)
df = df.set_index('fruit')
combined = pd.concat((combined, df), axis=1)
print(combined)
Output
This method is slightly more efficient:
combined = []
for fake_file in fake_files:
combined.append(pd.read_csv(fake_file).set_index('fruit'))
combined = pd.concat(combined, axis=1)
print(combined)
Output:
store quantity unit_price store quantity unit_price
fruit
apple fancy-grocers 2 9.25 bargain-grocers 170 0.15
pear fancy-grocers 3 100.00 bargain-grocers 281 0.45
banana fancy-grocers 1 256.00 bargain-grocers 667 0.01

solve SettingWithCopyWarning in pandas

Here is the issue I encountered:
/var/folders/v0/57ps6v293zx6jb2g78v0__1h0000gn/T/ipykernel_58173/392784622.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Here's my code
AMZN['VWDR'] = AMZN['Volume'] * AMZN['DailyReturn']/ AMZN['Volume'].cumsum()
I also tried the following, but it did not resolve the warning:
AMZN.loc[AMZN.index,'VWDR'] = AMZN.loc[AMZN.index, 'Volume'] * AMZN.loc[AMZN.index, 'DailyReturn']/ AMZN.loc[AMZN.index,'Volume'].cumsum()
Below are the codes to get my table:
import pandas as pd
import yfinance as yf
# now just read the html to get all the S&P500 tickers
dataload=pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
df = dataload[0]
# now get the first column(tickers) from the above data
# convert it into a list
ticker_list = df['Symbol'].values.tolist()
# convert the list into a string, separated by space, and replace . with -
all_tickers = " ".join(ticker_list).replace('.', '-') # this is to ensure that we could find BRK.B and BF.B
# get all the tickers from yfinance
tickers = yf.Tickers(all_tickers)
# set a start and end date to get two-years info
# group by the ticker
hist = tickers.history(start='2020-05-01', end='2022-05-01', group_by='ticker')
# ‘Stack’ the table to get it into row form
Data_stack = pd.DataFrame(hist.stack(level=0).reset_index().rename(columns = {'level_1':'Ticker'}))
# Add a column to the original table containing the daily return per ticker
Data_stack['DailyReturn'] = Data_stack.sort_values(['Ticker', 'Date']).groupby('Ticker')['Close'].pct_change()
Data_stack = Data_stack.set_index('Date') # now set the Date as the index
# get the AMZN data by sort the original table on Ticker
AMZN = Data_stack[Data_stack.Ticker=='AMZN']
For simplicity, you might just download the AMZN ticker table from yfinance
Copy AMZN when you create it:
AMZN = Data_stack[Data_stack.Ticker=='AMZN'].copy()
# ^^^^^^^
Then the rest of your code won't have a warning.
The one you are working with is the chained assignment and to resolve this one you need copy and omit loc for select columns.
AMZN = Data_stack[Data_stack.Ticker=='AMZN'].copy()
AMZN['VWDR'] = AMZN['Volume'] * AMZN['DailyReturn']/ AMZN['Volume'].cumsum()

Change dateformat

I have this code where I wish to change the dataformat. But I only manage to change one line and not the whole dataset.
Code:
import pandas as pd
df = pd.read_csv ("data_q_3.csv")
result = df.groupby ("Country/Region").max().sort_values(by='Confirmed', ascending=False)[:10]
pd.set_option('display.max_column', None)
print ("Covid 19 top 10 countries based on confirmed case:")
print(result)
from datetime import datetime
datetime.fromisoformat("2020-03-18T12:13:09").strftime("%Y-%m-%d-%H:%M")
Does anyone know how to fit the code so that the datetime changes in the whole dataset?
Thanks!
After looking at your problem for a while, I figured out how to change the values in the 'DateTime' column. The only problem that may arise is if the 'Country/Region' column has duplicate location names.
Editing the time is simple, as all you have to do is make use of pythons slicing. You can slice a string by typing
string = 'abcdefghijklnmopqrstuvwxyz'
print(string[0:5])
which will result in abcdef.
Below is the finished code.
import pandas as pd
# read unknown data
df = pd.read_csv("data_q_3.csv")
# List of unknown data
result = df.groupby("Country/Region").max().sort_values(by='Confirmed', ascending=False)[:10]
pd.set_option('display.max_column', None)
# you need a for loop to go through the whole column
for row in result.index:
# get the current stored time
time = result.at[row, 'DateTime']
# reformat the time string by slicing the
# string from index 0 to 10, and from index 12 to 16
# and putting a dash in the middle
time = time[0:10] + "-" + time[12:16]
# store the new time in the result
result.at[row, 'DateTime'] = time
#print result
print ("Covid 19 top 10 countries based on confirmed case:")
print(result)

Iterating through values of one column to get descriptive statistics for another column

I'm trying to get descriptive statistics for a column of data (the tb column which is a list of numbers) for every individual (i.e., each ID). Normally, I'd use a for i in range(len(list)) statement but since the ID is not a number I'm unsure of how to do that. Any tips would be helpful! The code included below gets me descriptive statistics for the entire tb column, instead of for tb data for each individual in the ID list.
df = pd.DataFrame(pd.read_csv("SurgeryTpref.csv")) #importing data
df.columns = ['date', 'time', 'tb', 'ID','before_after'] #column headers
df.to_numpy()
import pandas as pd
# read the data in with
df = pd.read_clipboard(sep=',')
# data
,date,time,tb,ID,before_after
0,6/29/20,4:15:33 PM,37.1,SR10,after
1,6/29/20,4:17:33 PM,38.1,SR10,after
2,6/29/20,4:19:33 PM,37.8,SR10,after
3,6/29/20,4:21:33 PM,37.5,SR10,after
4,6/29/20,4:23:33 PM,38.1,SR10,after
5,6/29/20,4:25:33 PM,38.5,SR10,after
6,6/29/20,4:27:33 PM,38.6,SR10,after
7,6/29/20,4:29:33 PM,37.6,SR10,after
8,6/29/20,4:31:33 PM,35.5,SR10,after
9,6/29/20,4:33:33 PM,34.7,SR10,after
summary=[]
for individual in (ID):
vals= df['tb'].describe()
summary.append(vals)
print(summary)

Trying to iterate and join Pandas DFs: AttributeError: 'Series' object has no attribute 'join'

I'm looking to pull the historical data for ~200 securities in a given index. I import the list of securities from a csv file then iterate over them to pull their respective data from the quandl api. That dataframe for each security has 12 columns, so I create a new column with the name of the security and the Adjusted Close value, so I can later identify the series.
I'm receiving an error when I try to join all the new columns into an empty dataframe. I receive an attribute error:
'''
Print output data
'''
grab_constituent_data()
AttributeError: 'Series' object has no attribute 'join'
Below is the code I have used to arrive here thus far.
'''
Import the modules necessary for analysis
'''
import quandl
import pandas as pd
import numpy as np
'''
Set file pathes and API keys
'''
ticker_path = ''
auth_key = ''
'''
Pull a list of tickers in the IGM ETF
'''
def ticker_list():
df = pd.read_csv('{}IGM Tickers.csv'.format(ticker_path))
# print(df['Ticker'])
return df['Ticker']
'''
Pull the historical prices for the securities within Ticker List
'''
def grab_constituent_data():
tickers = ticker_list()
main_df = pd.DataFrame()
for abbv in tickers:
query = 'EOD/{}'.format(str(abbv))
df = quandl.get(query, authtoken=auth_key)
print('Competed the query for {}'.format(query))
df['{} Adj_Close'.format(str(abbv))] = df['Adj_Close'].copy()
df = df['{} Adj_Close'.format(str(abbv))]
print('Completed the column adjustment for {}'.format(str(abbv)))
if main_df.empty:
main_df = df
else:
main_df = main_df.join(df)
print(main_df.head())
It seems that in your line
df = df['{} Adj_Close'.format(str(abbv))]
you're getting a Serie and not a Dataframe. If you want to convert your serie to a dataframe, you can use the function to_frame() like:
df = df['{} Adj_Close'.format(str(abbv))].to_frame()
I didn't check if your code might be more simple, but this should fix your issue.
To change a series into pandas dataframe you can use the following
df = pd.DataFrame(df)
After running above code, the series will become dataframe, then you can proceed with join tasks you have mentioned earlier

Categories