Python Financial Chart Scraping - python
Right now I'm trying to scrape the dividend yield from a chart using the following code.
df = pd.read_html('https://www.macrotrends.net/stocks/charts/BMO/Bank-of-Montreal/dividend-yield-history')
df = df[0].dropna()
But the code wont pick up the chart's data.
Any suggestions on pulling it from the website?
Here is the specific link I'm trying to use: https://www.macrotrends.net/stocks/charts/BMO/Bank-of-Montreal/dividend-yield-history
I've used the code for picking up the book values but the objects they're using for the dividends and book values must be different.
Maybe I could use Beautiful Soup?
Sadly that website is rendered dynamically, so there's nothing in the html pandas is getting to scrape from. (The chart is loaded after the page). Scraping manually isn't going to help you here, because the data isn't there. (It's fetched after the page is loaded.)
You can either find an api which provides the data (best, quite possible given the content), work out where the page is fetching its data from and see if you can get it directly (better if possible), or use something like selenium to control a real browser, render the page, get the html, and then use that.
Related
Python - How to use scrape table from website with dropdown of available rows
I am trying to scrape the earnings calendar data from the table from zacks.com and the url is attached below. https://www.zacks.com/stock/research/aapl/earnings-calendar The thing is I am trying to scrape all data from the table, but it has a dropdown list to select 10, 25, 50 and 100 rows on a page. Ideally I want to scrape for all 100 rows but when I select 100 from the dropdown list, the url doesn't change. My code is below. To note that the website blocks user-agent so I had to use chrome driver to impersonate human visiting the web. The obtained result from the pd.read_html is a list of all the tables and the d[4] returns the earnings calendar with only 10 rows (which I want to change to 100) driver = webdriver.Chrome('../files/chromedriver96') symbol = 'AAPL' url = 'https://www.zacks.com/stock/research/{}/earnings-calendar'.format(symbol) driver.get(url) content = driver.page_source d = pd.read_html(content) d[4] So calling help for anyone to guide me on this Thanks! UPDATE: it looks like my last post was downgraded due to lack of clear articulation and evidence of showing the past research. Maybe I am still a newbie to posting questions on this site. Actually, I have found several pages including this page with the same issue but the solutions didn't seem to work for me, which is why I came to post this as a new question UPDATE 12/05: Thanks a lot for the advise. As commented below, I finally got it working. Below is the code I used dropdown = driver.find_element_by_css_selector('#earnings_announcements_earnings_table_length') time.sleep(1) hundreds = dropdown.find_element_by_xpath(".//option[. = '100']") hundreds.click()
Having taken a look this is not going to be something that is easy to scrape. Given that the table is produced from the javascript I would say you have two options. Option one: Use selenium to render the page allowing the javascript to run. This way you can simply use the id/class of the drop down to interact with it. You can then scrape the data by looking at the values in the table. Option two: This is the more challenging one. Look through the data that the page gets in response and try to find requests which result in the data you then see on the page. By cross-referencing these there will be a way to directly request the data you want. You may find that to get at the data you want you need to accept a key from the original request to the page and then send that key as part of a second request. This way should allow you to scrape the data without having to run a selenium instance which will run more efficiently. My personal suggestion is to go with option one as computer resources are cheap and developer time expensive.
Web scraping for dummies (or not)
GOAL Extract data from a web page.. automatically. Data are on this page... Be careful , it's in French... MY HARD WAY, manually I choose the data I want by clicking on the desired fields on the left side ('CHOISIR DES INDICATEURS') Then I select ('Tableau' = Table), to have data table. Then I click on ('Action'), on the right side, then ('Exporter' = Export) I choose the format I want (ie CSV) and hit ('Executer'= Execute) to download the file. WHAT I TRIED I tried to automate this process, but It's like an impossible task for me. I tried to inspect the page for the network exchanges to see if there is an underlying server I could make easy json request. I mainly work with python and frameworks like BS4 or scrapy. I have few data to extract, so I can easily do it manually. Thus this question, I just purely for my own knowledge, to see if it is possible to scrape a page like that. I would appreciate if you could share your skills! Thank you,
It is possible. Check this website for details. This website will tell you how to scrape a website with an example. https://realpython.com/beautiful-soup-web-scraper-python/#scraping-the-monster-job-site
How to web scrape data which only appears on mouse hover using python beautiful soup?
So I am working on a project and on this link: https://www.nasdaq.com/market-activity/stocks/aapl/earnings I am able to extract all table data easily by normal beautiful soup method. However, on the site, there is a graph, and that data I need only comes when you hover your cursor over it. My issue is simple, how do you extract that? Because when I go to inspect source code of the site on the barchart, I only get the css part of it and the length of the bar and all, not the actual data of estimated and reported EPS which appears on the mouse hover. I wish I could try: divparent = soup.find_all('div', attrs={'class':'highcharts-point highcharts-color-0 highcharts-point-mouseOut'}) except: print("no table div") return I tried to do the code above, but to no avail, and I have literally no idea how to go about this. Any assistance for this would be greatly appreciated. Thank you.
This data is being added to the page using JavaScript, and is not in the response you got with the request to https://www.nasdaq.com/market-activity/stocks/aapl/earnings. However, you can get it using the API (this is what the JavaScript code does). Just send your get request to: https://api.nasdaq.com/api/quote/AAPL/eps
Python- Downloading a file from a webpage by clicking on a link
I've looked around the internet for a solution to this but none have really seemed applicable here. I'm writing a Python program to predict the next day's stock price using historical data. I don't need all the historical data since inception as Yahoo finance provides but only the last 60 days or so. The NASDAQ website provides just the right amount of historical data and I wanted to use that website. What I want to do is, go to a particular stock's profile on NASDAQ. For Example: (www.nasdaq.com/symbol/amd/historical) and click on the "Download this File in Excel Format" link at the very bottom. I inspected the page's HTML to see if there was an actual link I can just use with urllib to get the file but all I got was: <a id="lnkDownLoad" href="javascript:getQuotes(true);"> Download this file in Excel Format </a> No link. So my question is,how can I write a Python script that goes to a given stock's NASDAQ page, click on the Download file in excel format link and actually download the file from it. Most solutions online require you to know the url where the file is stored but in this case, I don't have access to that. So how do I go about doing this?
Using Chrome, go to View > Developer > Developer Tools In this new developer tools UI, change to the Network tab Navigate to the place where you would need to click, and click the ⃠ symbol to clear all recent activity. Click the link, and see if there was any requests made to the server If there was, click it, and see if you can reverse engineer the API of its endpoint Please be aware that this may be against the website's Terms of Service!
It appears that BeautifulSoup might be the easiest way to do this. I've made a cursory check that the results of the following script are the same as those that appear on the page. You would just have to write the results to a file, rather than print them. However, the columns are ordered differently. import requests from bs4 import BeautifulSoup URL = 'http://www.nasdaq.com/symbol/amd/historical' page = requests.get(URL).text soup = BeautifulSoup(page, 'lxml') tableDiv = soup.find_all('div', id="historicalContainer") tableRows = tableDiv[0].findAll('tr') for tableRow in tableRows[2:]: row = tuple(tableRow.getText().split()) print ('"%s",%s,%s,%s,%s,"%s"' % row) Output: "03/24/2017",14.16,14.18,13.54,13.7,"50,022,400" "03/23/2017",13.96,14.115,13.77,13.79,"44,402,540" "03/22/2017",13.7,14.145,13.55,14.1,"61,120,500" "03/21/2017",14.4,14.49,13.78,13.82,"72,373,080" "03/20/2017",13.68,14.5,13.54,14.4,"91,009,110" "03/17/2017",13.62,13.74,13.36,13.49,"224,761,700" "03/16/2017",13.79,13.88,13.65,13.65,"44,356,700" "03/15/2017",14.03,14.06,13.62,13.98,"55,070,770" "03/14/2017",14,14.15,13.6401,14.1,"52,355,490" "03/13/2017",14.475,14.68,14.18,14.28,"72,917,550" "03/10/2017",13.5,13.93,13.45,13.91,"62,426,240" "03/09/2017",13.45,13.45,13.11,13.33,"45,122,590" "03/08/2017",13.25,13.55,13.1,13.22,"71,231,410" "03/07/2017",13.07,13.37,12.79,13.05,"76,518,390" "03/06/2017",13,13.34,12.38,13.04,"117,044,000" "03/03/2017",13.55,13.58,12.79,13.03,"163,489,100" "03/02/2017",14.59,14.78,13.87,13.9,"103,970,100" "03/01/2017",15.08,15.09,14.52,14.96,"73,311,380" "02/28/2017",15.45,15.55,14.35,14.46,"141,638,700" "02/27/2017",14.27,15.35,14.27,15.2,"95,126,330" "02/24/2017",14,14.32,13.86,14.12,"46,130,900" "02/23/2017",14.2,14.45,13.82,14.32,"79,900,450" "02/22/2017",14.3,14.5,14.04,14.28,"71,394,390" "02/21/2017",13.41,14.1,13.4,14,"66,250,920" "02/17/2017",12.79,13.14,12.6,13.13,"40,831,730" "02/16/2017",13.25,13.35,12.84,12.97,"52,403,840" "02/15/2017",13.2,13.44,13.15,13.3,"33,655,580" "02/14/2017",13.43,13.49,13.19,13.26,"40,436,710" "02/13/2017",13.7,13.95,13.38,13.49,"57,231,080" "02/10/2017",13.86,13.86,13.25,13.58,"54,522,240" "02/09/2017",13.78,13.89,13.4,13.42,"72,826,820" "02/08/2017",13.21,13.75,13.08,13.56,"75,894,880" "02/07/2017",14.05,14.27,13.06,13.29,"158,507,200" "02/06/2017",12.46,13.7,12.38,13.63,"139,921,700" "02/03/2017",12.37,12.5,12.04,12.24,"59,981,710" "02/02/2017",11.98,12.66,11.95,12.28,"116,246,800" "02/01/2017",10.9,12.14,10.81,12.06,"165,784,500" "01/31/2017",10.6,10.67,10.22,10.37,"51,993,490" "01/30/2017",10.62,10.68,10.3,10.61,"37,648,430" "01/27/2017",10.6,10.73,10.52,10.67,"32,563,480" "01/26/2017",10.35,10.66,10.3,10.52,"35,779,140" "01/25/2017",10.74,10.975,10.15,10.35,"61,800,440" "01/24/2017",9.95,10.49,9.95,10.44,"43,858,900" "01/23/2017",9.68,10.06,9.68,9.91,"27,848,180" "01/20/2017",9.88,9.96,9.67,9.75,"27,936,610" "01/19/2017",9.92,10.25,9.75,9.77,"46,087,250" "01/18/2017",9.54,10.1,9.42,9.88,"51,705,580" "01/17/2017",10.17,10.23,9.78,9.82,"70,388,000" "01/13/2017",10.79,10.87,10.56,10.58,"38,344,340" "01/12/2017",10.98,11.0376,10.33,10.76,"75,178,900" "01/11/2017",11.39,11.41,11.15,11.2,"39,337,330" "01/10/2017",11.55,11.63,11.33,11.44,"29,122,540" "01/09/2017",11.37,11.64,11.31,11.49,"37,215,840" "01/06/2017",11.29,11.49,11.11,11.32,"34,437,560" "01/05/2017",11.43,11.69,11.23,11.24,"38,777,380" "01/04/2017",11.45,11.5204,11.235,11.43,"40,742,680" "01/03/2017",11.42,11.65,11.02,11.43,"55,114,820" "12/30/2016",11.7,11.78,11.25,11.34,"44,033,460" "12/29/2016",11.24,11.62,11.01,11.59,"50,180,310" "12/28/2016",12.28,12.42,11.46,11.55,"71,072,640" "12/27/2016",11.65,12.08,11.6,12.07,"44,168,130" The script escapes dates and thousands-separated numbers.
Dig a little bit deeper and find out what js function getQuotes() does. You should get a good clue from that. If it all seem too much complicated, then you can always use selenium. It is used to simulate the browser. However, it is much slower than using native network calls. You can find official documentation here.
Webscraping Financial Data from Morningstar
I am trying to scrape data from the morningstar website below: http://financials.morningstar.com/ratios/r.html?t=IBM®ion=USA&culture=en_US I am currently trying to do just IBM but hope to eventually be able to type in the code of another company and do this same with that one. My code so far is below: import requests, os, bs4, string url = 'http://financials.morningstar.com/ratios/r.html?t=IBM®ion=USA&culture=en_US'; fin_tbl = () page = requests.get(url) c = page.content soup = bs4.BeautifulSoup(c, "html.parser") summary = soup.find("div", {"class":"r_bodywrap"}) tables = summary.find_all('table') print(tables[0]) The problem I am experiencing at the moment is unlike a simpler webpage I have scraped the program can't seem to locate any tables even though I can see them in the HTML for the page. In researching this problem the closest stackoverflow question is below: Python webscraping - NoneObeject Failure - broken HTML? In that one they explained that Morningstar's tables are dynamically loaded and used some json code I am unfamiliar with and somehow generated a different weblink which managed to scrape the data but I don't understand where it came from?
It's a real problem scraping some modern web pages, particularly on pages generated by single-page applications (where the content is maintained by AJAX calls and DOM modification rather than delivered as ready-to-go HTML in a single server response). The best way I have found to access such content is to use the Selenium web testing environment to have a browser load the page under the control of my program, then extract the page contents from Selenium for scraping. There are other environments that will execute the scripts and modify the DOM appropriately, but I haven't used any of them. It's not as difficult as it sounds, but it will take you a little jiggering around to get there.
Web scraping can be greatly simplified when the site offers an API, be it officially supported or just an unofficial hack. Even the hack is better than trying to fiddle with the HTML which can change every day. So a search for morningstar api might be fruitful. And, in fact, some friendly Gister has already worked this out for you. Would the search be without result, a usually fruitful approach is to investigate what ajax calls the page is doing to retrieve data and then issue them directly. This can be achieved via the browser debuggers, tab "network" or so where each request can be investigated in detail in a very friendly UI.
I've found scraping dynamic sites to be a lot easier with JavaScript than with Python + Selenium. There is a great module for nodejs/phantomjs: ScraperJS. It is very easy to use: it injects jQuery into the scraped page and you can extract data with jQuery selectors.