I have a csv file which contains the ticker symbols for all the stocks listed on Nasdaq. Here is a link to that csv file. One can download it from there. There are more than 8000 stocks listed. Following is the code
import pandas as pd
import yfinance # pip install yfinance
tick_pd = pd.read_csv("/path/to/the/csv/file/nasdaq_screener_1654004691484.csv",
usecols = [0])
I have made a function which retrieves the historical stock prices for a ticker symbol. That function is as following:-
## function to be applied on each stock symbol
def appfunc(ticker):
A = yf.Ticker(ticker).history(period="max")
A["symbol"] = ticker
return A
And I apply this function to each row of the tick_pd, the following way:-
hist_prices = tick_pd.apply(appfunc)
But this takes way too much time, way way too much time. I was hoping if someone could find a way with which I can retrieve this data quite quickly. Or if there is a way I could parallelize it. I am quite new to python, so, I don't really know a lot of ways to do this.
Thanks in advance
You can use yf.download to download all tickers asynchronously::
tick_pd = pd.read_csv('nasdaq_screener_1654024849057.csv', usecols=[0])
df = yf.download(tick_pd['Symbol'].tolist(), period='max')
You can use threads as parameter of yf.download:
# Enable mass downloading (default is True)
df = yf.download(tick_pd['Symbol'].tolist(), period='max', threads=True)
# OR
# You can control the number of threads
df = yf.download(tick_pd['Symbol'].tolist(), period='max', threads=8)
Related
First off, thank you for taking the time to help me. We are using this python script that pulls data from Yahoo for a given time period.
import datetime as dt
import matplotlib.pyplot as plt
from matplotlib import style
import pandas as pd
import pandas_datareader.data as web
style.use('ggplot')
start = dt.datetime (2007,1,1)
end = dt.datetime(2022,1,31)
df = web.DataReader('AAPL','yahoo', start, end)
df.to_csv('AAPl.csv')
The code above grabs the data we need from Yahoo for the AAPL stock for the dates we set, then it creates a CSV for that stock. The problem we are running into is that we have 5000 different stocks we need to do this for. We have a CSV file with all the different tickers we need to run this program over. How can we modify our code to run over the different stocks from our CSV? Instead of us having to run this program manually 5000 times.
You don't have to write the intermediate dataframes (for individual stock symbols) to file. Try something like:
tickers = pd.read_csv(r'path_to/symbols.csv')['symbol_column_name'].values
full_df = pd.DataFrame({})
for ticker in tickers:
df = web.DataReader(ticker,'yahoo', start, end)
full_df = pd.concat((full_df, df))
#Now write merged table to file
full_df.to_csv('my_output_table.csv')
I'm a novice when it comes to Python and in order to learn it, I was working on a side project. My goal is to track card prices of my YGO cards using the yu-gi-oh prices API https://yugiohprices.docs.apiary.io/#
I am attempting to manually enter the print tag for each card and then have the API pull the data and populate the spreadsheet, such as the name of the card and its trait, in addition to the price data. So anytime I run the code, it is updated.
My idea was to use a for loop to get the API to search up each print tag and store the information in an empty dictionary and then post the results onto the excel file. I added an example of the spreadsheet.
Please let me know if I can clarify further. Any suggestions to the code that would help me achieve the goal for this project would be appreciated. Thanks in advance
import requests
import response as rsp
import urllib3
import urlopen
import json
import pandas as pd
df = pd.read_excel("api_ygo.xlsx")
print(df[:5]) # See the first 5 columns
response = requests.get('http://yugiohprices.com/api/price_for_print_tag/print_tag')
print(response.json())
data = []
for i in df:
print_tag = i[2]
request = requests.get('http://yugiohprices.com/api/price_for_print_tag/print_tag' + print_tag)
data.append(print_tag)
print(data)
def jprint(obj):
text = json.dumps(obj, sort_keys=True, indent=4)
print(text)
jprint(response.json())
Example Spreadsheet
Iterating over a pandas dataframe can be done using df.apply(). This has the added advantage that you can store the results directly in your dataframe.
First define a function that returns the desired result. Then apply the relevant column to that function while assigning the output to a new column:
import requests
import pandas as pd
import time
df = pd.DataFrame(['EP1-EN002', 'LED6-EN007', 'DRL2-EN041'], columns=['print_tag']) #just dummy data, in your case this is pd.read_excel
def get_tag(print_tag):
request = requests.get('http://yugiohprices.com/api/price_for_print_tag/' + print_tag) #this url works, the one in your code wasn't correct
time.sleep(1) #sleep for a second to prevent sending too many API calls per minute
return request.json()
df['result'] = df['print_tag'].apply(get_tag)
You can now export this column to a list of dictionaries with df['result'].tolist(). Or even better, you can flatten the results into a new dataframe with pd.json_normalize:
df2 = pd.json_normalize(df['result'])
df2.to_excel('output.xlsx') # save dataframe as new excel file
So I developed a script that would pull data from a live-updated site tracking coronavirus data. I set it up to pull data every 30 minutes but recently tested it on updates every 30 seconds.
The idea is that it creates the request to the site, pulls the html, creates a list of all of the data I need, then restructures into a dataframe (basically it's the country, the cases, deaths, etc.).
Then it will take each row and append to the rows of each of the 123 excel files that are for the various countries. This will work well for, I believe, somewhere in the range of 30-50 iterations before it either causes file corruptions or weird data entries.
I have my code below. I know it's poorly written (my initial reasoning was I felt confident I could set it up quickly and I wanted to collect data quickly.. unfortunately I overestimated my abilities but now I want to learn what went wrong). Below my code I'll include sample output.
PLEASE note that this 30 second interval code pull is only for quick testing. I don't usually look to send that many requests for months. I just wanted to see what the issue was. Originally it was set to pull every 30 minutes when I detected this issue.
See below for the code:
import schedule
import time
def RecurringProcess2():
import requests
from bs4 import BeautifulSoup
import pandas as pd
import datetime
import numpy as np
from os import listdir
import os
try:
extractTime = datetime.datetime.now()
extractTime = str(extractTime)
print("Access Initiated at " + extractTime)
link = 'https://www.worldometers.info/coronavirus/'
response = requests.get(link)
soup = BeautifulSoup(response.text,'html.parser').findAll('td')#[1107].get_text()
table = pd.DataFrame(columns=['Date and Time','Country','Total Cases','New Cases','Total Deaths','New Deaths','Total Recovered','Active Cases','Serious Critical','Total Cases/1M pop'])
soupList = []
for i in range(1107):
value = soup[i].get_text()
soupList.insert(i,value)
table = np.reshape(soupList,(123,-1))
table = pd.DataFrame(table)
table.columns=['Country','Total Cases','New Cases (+)','Total Deaths','New Deaths (+)','Total Recovered','Active Cases','Serious Critical','Total Cases/1M pop']
table['Date & Time'] = extractTime
#Below code is run once to generate the initial files. That's it.
# for i in range(122):
# fileName = table.iloc[i,0] + '.xlsx'
# table.iloc[i:i+1,:].to_excel(fileName)
FilesDirectory = 'D:\\Professional\\Coronavirus'
fileType = '.csv'
filenames = listdir(FilesDirectory)
DataFiles = [ filename for filename in filenames if filename.endswith(fileType) ]
for file in DataFiles:
countryData = pd.read_csv(file,index_col=0)
MatchedCountry = table.loc[table['Country'] == str(file)[:-4]]
if file == ' USA .csv':
print("Country Data Rows: ",len(countryData))
if os.stat(file).st_size < 1500:
print("File Size under 1500")
countryData = countryData.append(MatchedCountry)
countryData.to_csv(FilesDirectory+'\\'+file, index=False)
except :
pass
print("Process Complete!")
return
schedule.every(30).seconds.do(RecurringProcess2)
while True:
schedule.run_pending()
time.sleep(1)
When I check the code after some number of iterations (usually successful for like 30-50) it has either displayed only 2 rows and lost all other rows, or it'll keep appending while deleting a single entry in the row above while two rows above loses 2 entries, etc. (essentially forming a triangle of sorts).
Above that image would be a few hundred empty rows. Does anyone have an idea of what is going wrong here? I'd consider this a failed attempt but would still like to learn from this attempt. I appreciate any help in advance.
Hi as per my understanding the webpage only has one table element. My suggestion would be to use pandas read_html method as it provides clean and structured table.
Try the below code you can modify to schedule the same:-
import requests
import pandas as pd
url = 'https://www.worldometers.info/coronavirus/'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)
Disclaimer: I'm still evaluating this solution. So far it works almost perfectly for 77 rows.
Originally I had set the script up to run for .xlsx files. I converted everything to .csv but retained the index column code:
countryData = pd.read_csv(file,index_col=0)
I started realizing that things were being ordered differently every time the script ran. I have since removed that from the code and so far it works. Almost.
Unnamed: 0 Unnamed: 0.1
0 7
7
For some reason I have the above output in every file. I don't know why. But it's in the first 2 columns yet it still seems to be reading and writing correctly. Not sure what's going on here.
I'm trying to stream Live Quotes using the IexFinance API, keep in mind this is my first coding attempt. I've managed to be get the stock quote prices through python but I'm unsure how I would get that data then onto Excel.
From my understanding I would need to get this data into a csv file in order to export that into excel. I've tried adding the code df.to_csv('stock.csv') but I get the error 'StockReader' object has no attribute 'to_csv'
import pandas as pd
from iexfinance.stocks import stock
batch=Stock(['amd', 'tsla'], output_format='pandas')
batch.get_price
df.to_csv('stock.csv')
General pointer: Looks like you need to read into df variable first.
This line:
Stock(['amd', 'tsla'], output_format='pandas')
According to the guidance returns a dataframe
So:
df = Stock(['amd', 'tsla'], output_format='pandas')
Or, as you have now discovered:
df = batch.get_price
I'm doing some research on Cambridge Analytica and wanted to have as much news articles as I can from some news outlets.
I was able to scrape them and now have a bunch of JSON files in a folder.
Some of them have only this [] written in them while others have the data I need.
Using pandas I used the following and got every webTitle in the file.
df = pd.read_json(json_file)
df['webTitle']
The thing is that whenever there's an empty file it won't even let me assign df['webTitle'] to a variable.
Is there a way for me to check if it is empty and if it is just go to the next file?
I want to make this into a spreadsheet with a few of the keys and columns and the values as rows for each news article.
My files are organized by day and I've used TheGuardian API to get the data.
I did not write much yet but just in case here's the code as it is:
import pandas as pd
import os
def makePathToFile(path):
pathtoJson = []
for root,sub,filename in os.walk(path):
for i in filename:
pathToJson.append(os.path.join(path, i))
return pathToJson
def readJsonAndWriteCSV (pathToJson):
for json_file in pathToJson:
df = pd.read_json(json_file)
Thanks!
You can set up a google Alert for the news keywords you want, then scrape the results in python using https://pypi.org/project/galerts/