how to extract a certain value from a data frame? - python

I am scraping data from YAHOO and trying to pull a certain value from its data frame, using a value that is on another file.
I have managed to scrape the data and show it as a data frame. the thing is I am trying to extract a certain value from the data using another df.
this is the csv i got
df_earnings=pd.read_excel(r"C:Earnings to Update.xlsx",index_col=2)
stock_symbols = df_earnings.index
output:
Date E Time Company Name
Stock Symbol
CALM 2019-04-01 Before The Open Cal-Maine Foods
CTRA 2019-04-01 Before The Open Contura Energy
NVGS 2019-04-01 Before The Open Navigator Holdings
ANGO 2019-04-02 Before The Open AngioDynamics
LW 2019-04-02 Before The Open Lamb Weston`
then I download the csv for each stock with the data from yahoo finance:
driver.get(f'https://finance.yahoo.com/quote/{stock_symbol}/history?period1=0&period2=2597263000&interval=1d&filter=history&frequency=1d')
output:
Open High Low ... Adj Close Volume Stock Name
Date ...
1996-12-12 1.81250 1.8125 1.68750 ... 0.743409 1984400 CALM
1996-12-13 1.71875 1.8125 1.65625 ... 0.777510 996800 CALM
1996-12-16 1.81250 1.8125 1.71875 ... 0.750229 122000 CALM
1996-12-17 1.75000 1.8125 1.75000 ... 0.774094 239200 CALM
1996-12-18 1.81250 1.8125 1.75000 ... 0.791151 216400 CALM
my problem is here I don't know how to find the date form my data frame and extract it from the downloaded file.
now I don't want to insert a manual date like this :
df = pd.DataFrame.from_csv(file_path)
df['Stock Name'] = stock_symbol
print(df.head())
df = df.reset_index()
print(df.loc[df['Date'] == '2019-04-01'])
output:
Date Open High ... Adj Close Volume Stock Name
5610 2019-04-01 46.700001 47.0 ... 42.987827 846900 CALM
I want a condition that will run my data frame for each stock and pull the date needed
print(df.loc[df['Date'] == the date that is next to the symbol that i just downloaded the file for])

I suppose you could make use of a variable to hold the date.
for sy in stock_symbols:
# The value from the 'Date' column in df_earnings
dt = df_earnings.loc[df_earnings.index == sy, 'Date'][sy]
# From the second block of your code relating to 'manual' date
df = pd.DataFrame.from_csv(file_path)
df['Stock Name'] = sy
df = df.reset_index()
print(df.loc[df['Date'] == dt])

Related

Name dataSet as value from row pandas

I am trying to download stock data from yfinance. I have a ticker list called ticker_list_1 with a bunch of symbols. The structure is:
array(['A', 'AAP', 'AAPL', 'ABB', 'ABBV', 'ABC', 'ABEV', 'ABMD', 'ABNB',....
Now I am using the following code to download the data:
i = 0
xrz_data = {}
for i, len in enumerate (ticker_list_1):
data = yf.download(ticker_list_1[i], start ="2019-01-01", end="2022-10-20")
xrz_data[i] = data
i=i+1
The problem I am having is that once downloaded the data is saved as xrz_data. I can access a specific dataset through the code xrz_data[I] with the corresponding number. However I want to be able to access the data through xrz_data[tickername].
How can I achieve this ?
I have tried using xrz_data[i].values = data which gives me the error KeyError: 0
EDIT:
Here is the current output of the for loop:
{0: Open High Low Close Adj Close \
Date
2018-12-31 66.339996 67.480003 66.339996 67.459999 65.660202
2019-01-02 66.500000 66.570000 65.300003 65.690002 63.937435
2019-01-03 65.529999 65.779999 62.000000 63.270000 61.582008
2019-01-04 64.089996 65.949997 64.089996 65.459999 63.713589
2019-01-07 65.639999 67.430000 65.610001 66.849998 65.066483
... ... ... ... ... ...
2022-10-13 123.000000 128.830002 122.349998 127.900002 127.900002
2022-10-14 129.000000 130.220001 125.470001 125.699997 125.699997
2022-10-17 127.379997 131.089996 127.379997 130.559998 130.559998
2022-10-18 133.919998 134.679993 131.199997 132.300003 132.300003
2022-10-19 130.110001 130.270004 127.239998 128.960007 128.960007
[959 rows x 6 columns],
1: Open High Low Close Adj Close \
Date
2018-12-31 156.050003 157.679993 154.990005 157.460007 149.844055
2019-01-02 156.160004 159.919998 153.820007 157.919998 150.281815
........
My desired output would be:
AAP: Open High Low Close Adj Close \
Date
2018-12-31 156.050003 157.679993 154.990005 157.460007 149.844055
2019-01-02 156.160004 159.919998 153.820007 157.919998 150.281815
........
I was able to figure it out. With the following code I am able to download the data and access specific tickers through f.e. xrz_data['AAPL‘]
i = 0
xrz_data = {}
for ticker in ticker_list_1:
data = yf.download(ticker, start ="2019-01-01", end="2022-10-20")
xrz_data[ticker] = data

iterrate and save each stock historical data in dataframe without downloading in CSV

I would like to pull historical data from yfinance for a specific list of stocks. I want to store earch stock in a separate dataframes (each stock with its own df).
I can download it to multiple csv's through below code, but I couldn't find a way to store them in different dataframes (wihtout having to download them to csv)
import yfinance
stocks = ['TSLA','MSFT','NIO','AAPL','AMD','ADBE','ALGN','AMZN','AMGN','AEP','ADI','ANSS','AMAT','ASML','TEAM','ADSK']
for i in stocks:
df = yfinance.download(i, start='2015-01-01', end='2021-09-12')
df.to_csv( i + '.csv')
I want my end results to be a dataframe called "TSLA" for tsla historical data and another one called "MSFT" for msft data...and so on
I tried:
stock = ['TSLA','MSFT','NIO','AAPL','AMD']
df_ = {}
for i in stock:
df = yfinance.download(i, start='2015-01-01', end='2021-09-12')
df_["{}".format(i)] = df
And I have to call each dataframe by key to get it like df_["TSLA"] but this is not what I want. I need a datafram called only TSLA that have tsla data and so on. Is there a way to do it?
You don't need to download data multiple times. You just have to split whole data with groupby and create variables dynamically with locals():
stocks = ['TSLA', 'MSFT', 'NIO', 'AAPL', 'AMD', 'ADBE', 'ALGN', 'AMZN',
'AMGN', 'AEP', 'ADI', 'ANSS', 'AMAT', 'ASML', 'TEAM', 'ADSK']
data = yfinance.download(stocks, start='2015-01-01', end='2021-09-12')
for stock, df in data.groupby(level=1, axis=1):
locals()[stock] = df.droplevel(level=1, axis=1)
df.to_csv(f'{stock}.csv')
Output:
>>> TSLA
Adj Close Close High Low Open Volume
Date
2014-12-31 44.481998 44.481998 45.136002 44.450001 44.618000 11487500
2015-01-02 43.862000 43.862000 44.650002 42.652000 44.574001 23822000
2015-01-05 42.018002 42.018002 43.299999 41.431999 42.910000 26842500
2015-01-06 42.256001 42.256001 42.840000 40.841999 42.012001 31309500
2015-01-07 42.189999 42.189999 42.956001 41.956001 42.669998 14842000
... ... ... ... ... ... ...
2021-09-03 733.570007 733.570007 734.000000 724.200012 732.250000 15246100
2021-09-07 752.919983 752.919983 760.200012 739.260010 740.000000 20039800
2021-09-08 753.869995 753.869995 764.450012 740.770020 761.580017 18793000
2021-09-09 754.859985 754.859985 762.099976 751.630005 753.409973 14077700
2021-09-10 736.270020 736.270020 762.609985 734.520020 759.599976 15114300
[1686 rows x 6 columns]
>>> ANSS
Adj Close Close High Low Open Volume
Date
2014-12-31 82.000000 82.000000 83.480003 81.910004 83.080002 304600
2015-01-02 81.639999 81.639999 82.629997 81.019997 82.089996 282600
2015-01-05 80.860001 80.860001 82.070000 80.779999 81.290001 321500
2015-01-06 79.260002 79.260002 81.139999 78.760002 81.000000 344300
2015-01-07 79.709999 79.709999 80.900002 78.959999 79.919998 233300
... ... ... ... ... ... ...
2021-09-03 368.380005 368.380005 371.570007 366.079987 366.079987 293000
2021-09-07 372.070007 372.070007 372.410004 364.950012 369.609985 249500
2021-09-08 372.529999 372.529999 375.820007 369.880005 371.079987 325800
2021-09-09 371.970001 371.970001 375.799988 371.320007 372.519989 194900
2021-09-10 373.609985 373.609985 377.260010 372.470001 374.540009 278800
[1686 rows x 6 columns]
You can create global or local variable like
globals()["TSLA"] = "some value"
print(TSLA)
locals()["TSLA"] = "some value"
print(TSLA)
but frankly it is waste of time. It is much more useful to keep it as dictionary.
With dictionary you can use for-loop to run some code on all dataframes.
You can also seletect dataframes by name. etc.
Examples:
df_max = {}
for name, df in df_.items():
df_max[name] = df.max()
name = input("What to display: ")
df_[name].plot()

Adding value to new column based on if two other columns match

I have a dataframe (df1) that looks like this;
title
score
id
timestamp
Stock_name
Biocryst ($BCRX) continues to remain undervalued
120
mfuz84
2021-01-28 21:32:10
...and then continues with 44000 something more rows. I have another dataframe (df2) that looks like this;
Company name
Symbol
BioCryst Pharmaceuticals, Inc.
BCRX
GameStop
GME
Apple Inc.
AAPL
...containing all nasdaq and NYSE listed stocks. What I want to do now however, is to add the symbol of the stock to the column "Stock_name" in df1. In order to do this, I want to match the df1[title] with the df2[Symbol] and then based on what symbol has a match in the title, add the corresponding stock name (df2[Company name]) to the df1[Stock_name] column. If there is more than one stock name in the title, I want to use the first one mentioned.
Is there any easy way to do this?
I tried with this little dataset and it's working, let me know if you have some problems
df1 = pd.DataFrame({"title" : ["Biocryst ($BCRX) continues to remain undervalued", "AAPL is good, buy it"], 'score' : [120,420] , 'Stock_name' : ["",""] })
df2 = pd.DataFrame({'Company name' : ['BioCryst Pharmaceuticals, Inc.','GameStop','Apple Inc.'], 'Symbol' : ["BCRX","GME","AAPL"]})
df1
title score Stock_name
0 Biocryst ($BCRX) continues to remain undervalued 120
1 AAPL is good, buy it 420
df2
Company name Symbol
0 BioCryst Pharmaceuticals, Inc. BCRX
1 GameStop GME
2 Apple Inc. AAPL
for j in range(0,len(df2)):
for i in range(0,len(df1)):
if df2['Symbol'][j] in df1['title'][i]:
df1['Stock_name'][i] = df2['Symbol'][j]
df1
title score Stock_name
0 Biocryst ($BCRX) continues to remain undervalued 120 BCRX
1 AAPL is good, buy it 420 AAPL
First, I think you should create a dictionary based on df2.
symbol_lookup = dict(zip(df2['Symbol'],df2['Company name']))
Then you need a function that will parse the title column. If you can rely on stock symbols being preceded by a dollar sign, you can use the following:
def find_name(input_string):
for symbol in input_string.split('$'):
#if the first four characters form
#a stock symbol, return the name
if symbol_lookup.get(symbol[:4]):
return symbol_lookup.get(symbol[:4])
#otherwise check the first three characters
if symbol_lookup.get(symbol[:3]):
return symbol_lookup.get(symbol[:3])
You could also write a function based on expecting the symbols to be in parentheses. If you can't rely on either, it would be more complicated.
Finally, you can apply your function to the title column:
df1['Stock_name'] = df1['title'].apply(find_name)

How to get values from a dict into a new column, based on values in column

I have a dictionary that contains all of the information for company ticker : sector. For example 'AAPL':'Technology'.
I have a CSV file that looks like this:
ticker,dimension,calendardate,datekey,reportperiod,lastupdated,accoci,assets,assetsavg,assetsc,assetsnc,assetturnover,bvps,capex,cashneq,cashnequsd,cor,consolinc,currentratio,de,debt,debtc,debtnc,debtusd,deferredrev,depamor,deposits,divyield,dps,ebit,ebitda,ebitdamargin,ebitdausd,ebitusd,ebt,eps,epsdil,epsusd,equity,equityavg,equityusd,ev,evebit,evebitda,fcf,fcfps,fxusd,gp,grossmargin,intangibles,intexp,invcap,invcapavg,inventory,investments,investmentsc,investmentsnc,liabilities,liabilitiesc,liabilitiesnc,marketcap,ncf,ncfbus,ncfcommon,ncfdebt,ncfdiv,ncff,ncfi,ncfinv,ncfo,ncfx,netinc,netinccmn,netinccmnusd,netincdis,netincnci,netmargin,opex,opinc,payables,payoutratio,pb,pe,pe1,ppnenet,prefdivis,price,ps,ps1,receivables,retearn,revenue,revenueusd,rnd,roa,roe,roic,ros,sbcomp,sgna,sharefactor,sharesbas,shareswa,shareswadil,sps,tangibles,taxassets,taxexp,taxliabilities,tbvps,workingcapital
A,ARQ,1999-12-31,2000-03-15,2000-01-31,2020-09-01,53000000,7107000000,,4982000000,2125000000,,10.219,-30000000,1368000000,1368000000,1160000000,131000000,2.41,0.584,665000000,111000000,554000000,665000000,281000000,96000000,0,0.0,0.0,202000000,298000000,0.133,298000000,202000000,202000000,0.3,0.3,0.3,4486000000,,4486000000,50960600000,,,354000000,0.806,1.0,1086000000,0.484,0,0,4337000000,,1567000000,42000000,42000000,0,2621000000,2067000000,554000000,51663600000,1368000000,-160000000,2068000000,111000000,0,1192000000,-208000000,-42000000,384000000,0,131000000,131000000,131000000,0,0,0.058,915000000,171000000,635000000,0.0,11.517,,,1408000000,0,114.3,,,1445000000,131000000,2246000000,2246000000,290000000,,,,,0,625000000,1.0,452000000,439000000,440000000,5.116,7107000000,0,71000000,113000000,16.189,2915000000
A,ARQ,2000-03-31,2000-06-12,2000-04-30,2020-09-01,-4000000,7321000000,,5057000000,2264000000,,10.27,-95000000,978000000,978000000,1261000000,166000000,2.313,0.577,98000000,98000000,0,98000000,329000000,103000000,0,0.0,0.0,256000000,359000000,0.144,359000000,256000000,256000000,0.37,0.36,0.37,4642000000,,4642000000,28969949822,,,-133000000,-0.294,1.0,1224000000,0.493,0,0,4255000000,,1622000000,0,0,0,2679000000,2186000000,493000000,29849949822,-390000000,-326000000,2000000,-13000000,0,-11000000,-341000000,95000000,-38000000,0,166000000,166000000,166000000,0,0,0.067,1010000000,214000000,572000000,0.0,6.43,,,1453000000,0,66.0,,,1826000000,297000000,2485000000,2485000000,296000000,,,,,0,714000000,1.0,452271967,452000000,457000000,5.498,7321000000,0,90000000,192000000,16.197,2871000000
A,ARQ,2000-06-30,2000-09-01,2000-07-31,2020-09-01,-6000000,7827000000,,5344000000,2483000000,,10.821,-222000000,703000000,703000000,1369000000,155000000,2.129,0.597,129000000,129000000,0,129000000,361000000,146000000,0,0.0,0.0,238000000,384000000,0.144,384000000,238000000,238000000,0.34,0.34,0.34,4902000000,,4902000000,27458542149,30,19.97,-153000000,-0.338,1.0,1301000000,0.487,0,0,4743000000,,1762000000,0,0,0,2925000000,2510000000,415000000,28032542149,-275000000,-181000000,42000000,31000000,0,73000000,-417000000,-15000000,69000000,0,155000000,155000000,155000000,0,0,0.058,1091000000,210000000,783000000,0.0,5.719,46.877,44.2,1581000000,0,61.88,2.846,2.846,2167000000,452000000,2670000000,2670000000,318000000,,,,,0,773000000,1.0,453014579,453000000,461000000,5.894,7827000000,0,83000000,238000000,17.278,2834000000
I would like to have my dictionary match up with all the tickers in the CSV file and then write the corresponding values to a column in the CSV called sector.
Code:
for ticker in company_dic:
sf1['sector'] = sf1['ticker'].apply(company_dic[ticker])
The code is giving me problems.
For example, the first sector is healthcare, I get this error:
ValueError: Healthcare is an unknown string function
Would appreciate some help. I'm sure there's a pretty simple solution for this. Maybe using iterrows()?
Use .map, not .apply to select values from a dict, by using a column value as a key, because .map is the method specifically implemented for this operation.
.map will return NaN if the ticker is not in the dict.
.apply can be used, but .map should be used
df['sector'] = df.ticker.apply(lambda x: company_dict.get(x))
.get will return None if the ticker isn't in the dict.
import pandas as pd
# test dataframe for this example
df = pd.DataFrame({'ticker': ['AAPL', 'AAPL', 'AAPL'], 'dimension': ['ARQ', 'ARQ', 'ARQ'], 'calendardate': ['1999-12-31', '2000-03-31', '2000-06-30'], 'datekey': ['2000-03-15', '2000-06-12', '2000-09-01']})
# in your case, load the data from the file
df = pd.read_csv('file.csv')
# display(df)
ticker dimension calendardate datekey
0 AAPL ARQ 1999-12-31 2000-03-15
1 AAPL ARQ 2000-03-31 2000-06-12
2 AAPL ARQ 2000-06-30 2000-09-01
# dict of sectors
company_dict = {'AAPL': 'tech'}
# insert the sector column using map, into a specific column index
df.insert(loc=1, column='sector', value=df['ticker'].map(company_dict))
# display(df)
ticker sector dimension calendardate datekey
0 AAPL tech ARQ 1999-12-31 2000-03-15
1 AAPL tech ARQ 2000-03-31 2000-06-12
2 AAPL tech ARQ 2000-06-30 2000-09-01
# write the updated data back to the csv file
df.to_csv('file.csv', index=Fales)
temp = sf1.ticker.map(lambda x: company_dic[str(x)]) (#faster than for loop)
sf1['sector'] = temp
You can pass na_action='ignore' if you have NAN's in tickers column

How to edit a dataframe row by row while itterating?

So I am using a script to read a CSV and create a data frame from which it then scrapes price data using tickers from said data frame. The original data frame has the following columns, note NO 'Price'.
df.columns = ['Ticker TV', 'Ticker YF', 'TV Name', 'Sector', 'Industry', 'URLTV']
I've printed the below first couple of outputs from my "updated" data frame
Ticker TV Ticker YF ... URLTV Price
1 100D 100D.L ... URL NaN
2 1GIS 1GIS.L ... URL NaN
3 1MCS 1MCS.L ... URL NaN
... ... ... ... ... ...
2442 ZYT ZYT.L ...URL NaN
100D.L NaN NaN .. NaN 9272.50
1GIS.L NaN NaN ...NaN 8838.50
1MCS.L NaN NaN ...NaN 5364.00
As you can see it's not working as intended I would like to create a new column with the name of Price and attach each price with the correct ticker so 100D.L should be 9272.50 then when the script iterates to the next ticker it adds the next price value to 1GIS and so forth.
tickerList = df['Ticker YF']
for tick in tickerList:
summarySoup = getSummary(tick)
currentPriceData = priceData(summarySoup)
print('The Price of '+tick+ ' is '+str(currentPriceData))
df.at[tick,'Price'] = currentPriceData
Assign price using apply method:
df['Price'] = df['Ticker YF'].apply(lambda x: str(priceData(getSummary(x))))
tick is just the value from your 'Ticker YF' column. so you can use enumerate to get also the index. And if you want to access the former price to add them up you can then just use idx-1
tickerList = df['Ticker YF']
for idx, tick in enumerate(tickerList):
summarySoup = getSummary(tick)
currentPriceData = priceData(summarySoup)
print('The Price of '+tick+ ' is '+str(currentPriceData))
if idx!=0:
df.at[idx+1,'Price'] = float(currentPriceData)+float(df.at[idx,'Price'])
else:
df.at[idx+1,'Price'] = float(currentPriceData)
A more "elegant" idea could be something like:
df["Single_Price"]=df["Ticker YF"].apply(lambda x: priceData(getSummary(x)))
to get the value of the single prices. And then create the next column with the added prices:
df["Price"]=df["Ticker"].apply(lambda x: df["Single_Price"][df["Ticker"]<x["Ticker"]].sum())
this will add up every Single_Price (df["Single_Price"]) from every row that is before your current row Ticker x (df["Ticker"] < x["Ticker"]) and creates a new column Price in your dataframe.
after that cou can simply delete the single prices if you don't need them with:
del df["Single_Price"]

Categories