Python- Downloading a file from a webpage by clicking on a link - python
I've looked around the internet for a solution to this but none have really seemed applicable here. I'm writing a Python program to predict the next day's stock price using historical data. I don't need all the historical data since inception as Yahoo finance provides but only the last 60 days or so. The NASDAQ website provides just the right amount of historical data and I wanted to use that website.
What I want to do is, go to a particular stock's profile on NASDAQ. For Example: (www.nasdaq.com/symbol/amd/historical) and click on the "Download this File in Excel Format" link at the very bottom. I inspected the page's HTML to see if there was an actual link I can just use with urllib to get the file but all I got was:
<a id="lnkDownLoad" href="javascript:getQuotes(true);">
Download this file in Excel Format
</a>
No link. So my question is,how can I write a Python script that goes to a given stock's NASDAQ page, click on the Download file in excel format link and actually download the file from it. Most solutions online require you to know the url where the file is stored but in this case, I don't have access to that. So how do I go about doing this?
Using Chrome, go to View > Developer > Developer Tools
In this new developer tools UI, change to the Network tab
Navigate to the place where you would need to click, and click the ⃠ symbol to clear all recent activity.
Click the link, and see if there was any requests made to the server
If there was, click it, and see if you can reverse engineer the API of its endpoint
Please be aware that this may be against the website's Terms of Service!
It appears that BeautifulSoup might be the easiest way to do this. I've made a cursory check that the results of the following script are the same as those that appear on the page. You would just have to write the results to a file, rather than print them. However, the columns are ordered differently.
import requests
from bs4 import BeautifulSoup
URL = 'http://www.nasdaq.com/symbol/amd/historical'
page = requests.get(URL).text
soup = BeautifulSoup(page, 'lxml')
tableDiv = soup.find_all('div', id="historicalContainer")
tableRows = tableDiv[0].findAll('tr')
for tableRow in tableRows[2:]:
row = tuple(tableRow.getText().split())
print ('"%s",%s,%s,%s,%s,"%s"' % row)
Output:
"03/24/2017",14.16,14.18,13.54,13.7,"50,022,400"
"03/23/2017",13.96,14.115,13.77,13.79,"44,402,540"
"03/22/2017",13.7,14.145,13.55,14.1,"61,120,500"
"03/21/2017",14.4,14.49,13.78,13.82,"72,373,080"
"03/20/2017",13.68,14.5,13.54,14.4,"91,009,110"
"03/17/2017",13.62,13.74,13.36,13.49,"224,761,700"
"03/16/2017",13.79,13.88,13.65,13.65,"44,356,700"
"03/15/2017",14.03,14.06,13.62,13.98,"55,070,770"
"03/14/2017",14,14.15,13.6401,14.1,"52,355,490"
"03/13/2017",14.475,14.68,14.18,14.28,"72,917,550"
"03/10/2017",13.5,13.93,13.45,13.91,"62,426,240"
"03/09/2017",13.45,13.45,13.11,13.33,"45,122,590"
"03/08/2017",13.25,13.55,13.1,13.22,"71,231,410"
"03/07/2017",13.07,13.37,12.79,13.05,"76,518,390"
"03/06/2017",13,13.34,12.38,13.04,"117,044,000"
"03/03/2017",13.55,13.58,12.79,13.03,"163,489,100"
"03/02/2017",14.59,14.78,13.87,13.9,"103,970,100"
"03/01/2017",15.08,15.09,14.52,14.96,"73,311,380"
"02/28/2017",15.45,15.55,14.35,14.46,"141,638,700"
"02/27/2017",14.27,15.35,14.27,15.2,"95,126,330"
"02/24/2017",14,14.32,13.86,14.12,"46,130,900"
"02/23/2017",14.2,14.45,13.82,14.32,"79,900,450"
"02/22/2017",14.3,14.5,14.04,14.28,"71,394,390"
"02/21/2017",13.41,14.1,13.4,14,"66,250,920"
"02/17/2017",12.79,13.14,12.6,13.13,"40,831,730"
"02/16/2017",13.25,13.35,12.84,12.97,"52,403,840"
"02/15/2017",13.2,13.44,13.15,13.3,"33,655,580"
"02/14/2017",13.43,13.49,13.19,13.26,"40,436,710"
"02/13/2017",13.7,13.95,13.38,13.49,"57,231,080"
"02/10/2017",13.86,13.86,13.25,13.58,"54,522,240"
"02/09/2017",13.78,13.89,13.4,13.42,"72,826,820"
"02/08/2017",13.21,13.75,13.08,13.56,"75,894,880"
"02/07/2017",14.05,14.27,13.06,13.29,"158,507,200"
"02/06/2017",12.46,13.7,12.38,13.63,"139,921,700"
"02/03/2017",12.37,12.5,12.04,12.24,"59,981,710"
"02/02/2017",11.98,12.66,11.95,12.28,"116,246,800"
"02/01/2017",10.9,12.14,10.81,12.06,"165,784,500"
"01/31/2017",10.6,10.67,10.22,10.37,"51,993,490"
"01/30/2017",10.62,10.68,10.3,10.61,"37,648,430"
"01/27/2017",10.6,10.73,10.52,10.67,"32,563,480"
"01/26/2017",10.35,10.66,10.3,10.52,"35,779,140"
"01/25/2017",10.74,10.975,10.15,10.35,"61,800,440"
"01/24/2017",9.95,10.49,9.95,10.44,"43,858,900"
"01/23/2017",9.68,10.06,9.68,9.91,"27,848,180"
"01/20/2017",9.88,9.96,9.67,9.75,"27,936,610"
"01/19/2017",9.92,10.25,9.75,9.77,"46,087,250"
"01/18/2017",9.54,10.1,9.42,9.88,"51,705,580"
"01/17/2017",10.17,10.23,9.78,9.82,"70,388,000"
"01/13/2017",10.79,10.87,10.56,10.58,"38,344,340"
"01/12/2017",10.98,11.0376,10.33,10.76,"75,178,900"
"01/11/2017",11.39,11.41,11.15,11.2,"39,337,330"
"01/10/2017",11.55,11.63,11.33,11.44,"29,122,540"
"01/09/2017",11.37,11.64,11.31,11.49,"37,215,840"
"01/06/2017",11.29,11.49,11.11,11.32,"34,437,560"
"01/05/2017",11.43,11.69,11.23,11.24,"38,777,380"
"01/04/2017",11.45,11.5204,11.235,11.43,"40,742,680"
"01/03/2017",11.42,11.65,11.02,11.43,"55,114,820"
"12/30/2016",11.7,11.78,11.25,11.34,"44,033,460"
"12/29/2016",11.24,11.62,11.01,11.59,"50,180,310"
"12/28/2016",12.28,12.42,11.46,11.55,"71,072,640"
"12/27/2016",11.65,12.08,11.6,12.07,"44,168,130"
The script escapes dates and thousands-separated numbers.
Dig a little bit deeper and find out what js function getQuotes() does. You should get a good clue from that.
If it all seem too much complicated, then you can always use selenium. It is used to simulate the browser. However, it is much slower than using native network calls. You can find official documentation here.
Related
Python - How to use scrape table from website with dropdown of available rows
I am trying to scrape the earnings calendar data from the table from zacks.com and the url is attached below. https://www.zacks.com/stock/research/aapl/earnings-calendar The thing is I am trying to scrape all data from the table, but it has a dropdown list to select 10, 25, 50 and 100 rows on a page. Ideally I want to scrape for all 100 rows but when I select 100 from the dropdown list, the url doesn't change. My code is below. To note that the website blocks user-agent so I had to use chrome driver to impersonate human visiting the web. The obtained result from the pd.read_html is a list of all the tables and the d[4] returns the earnings calendar with only 10 rows (which I want to change to 100) driver = webdriver.Chrome('../files/chromedriver96') symbol = 'AAPL' url = 'https://www.zacks.com/stock/research/{}/earnings-calendar'.format(symbol) driver.get(url) content = driver.page_source d = pd.read_html(content) d[4] So calling help for anyone to guide me on this Thanks! UPDATE: it looks like my last post was downgraded due to lack of clear articulation and evidence of showing the past research. Maybe I am still a newbie to posting questions on this site. Actually, I have found several pages including this page with the same issue but the solutions didn't seem to work for me, which is why I came to post this as a new question UPDATE 12/05: Thanks a lot for the advise. As commented below, I finally got it working. Below is the code I used dropdown = driver.find_element_by_css_selector('#earnings_announcements_earnings_table_length') time.sleep(1) hundreds = dropdown.find_element_by_xpath(".//option[. = '100']") hundreds.click()
Having taken a look this is not going to be something that is easy to scrape. Given that the table is produced from the javascript I would say you have two options. Option one: Use selenium to render the page allowing the javascript to run. This way you can simply use the id/class of the drop down to interact with it. You can then scrape the data by looking at the values in the table. Option two: This is the more challenging one. Look through the data that the page gets in response and try to find requests which result in the data you then see on the page. By cross-referencing these there will be a way to directly request the data you want. You may find that to get at the data you want you need to accept a key from the original request to the page and then send that key as part of a second request. This way should allow you to scrape the data without having to run a selenium instance which will run more efficiently. My personal suggestion is to go with option one as computer resources are cheap and developer time expensive.
Python - Beautiful Soup to grab emails from website
I've been trying to figure out a simple way to run through a set of URLs that lead to pages that all have the same layout. We figured out that one issue is that in the original list the URLs are http but then they redirect to https. I am not sure if that then causes a problem in trying to pull the information from the page. I can see the structure of the page when I use Inspector in Chrome, but when I try to set up the code to grab relevant links I come up empty (literally). The most general code I have been using is: soup = BeautifulSoup(urllib2.urlopen('https://ngcproject.org/program/algirls').read()) links = SoupStrainer('a') print links which yields: a|{} Given that I'm new to this I've been trying to work with anything that I think might work. I also tried: mail = soup.find(attrs={'class':'tc-connect-details_send-email'}).a['href'] and spans = soup.find_all('span', {'class' : 'tc-connect-details_send-email'}) lines = [span.get_text() for span in spans] print lines but these don't yield anything either. I am assuming that it's an issue with my code and not one that the data are hidden from being scraped. Ideally I want to have the data passed to a CSV file for each URL I scrape but right now I need to be able to confirm that the code is actually grabbing the right information. Any suggestions welcome!
If you press CTRL+U on Google Chrome or Right click > view source. You'll see that the page is rendered by using javascript or other. urllib is not going to be able to display/download what you're looking for. You'll have to use automated browser (Selenium - most popular) and you can use it with Google Chrome / Firefox or a headless browser (PhantomJS). You can then get the information from Selenium and store it then manipulate it in anyway you see fit.
Webscraping Financial Data from Morningstar
I am trying to scrape data from the morningstar website below: http://financials.morningstar.com/ratios/r.html?t=IBM®ion=USA&culture=en_US I am currently trying to do just IBM but hope to eventually be able to type in the code of another company and do this same with that one. My code so far is below: import requests, os, bs4, string url = 'http://financials.morningstar.com/ratios/r.html?t=IBM®ion=USA&culture=en_US'; fin_tbl = () page = requests.get(url) c = page.content soup = bs4.BeautifulSoup(c, "html.parser") summary = soup.find("div", {"class":"r_bodywrap"}) tables = summary.find_all('table') print(tables[0]) The problem I am experiencing at the moment is unlike a simpler webpage I have scraped the program can't seem to locate any tables even though I can see them in the HTML for the page. In researching this problem the closest stackoverflow question is below: Python webscraping - NoneObeject Failure - broken HTML? In that one they explained that Morningstar's tables are dynamically loaded and used some json code I am unfamiliar with and somehow generated a different weblink which managed to scrape the data but I don't understand where it came from?
It's a real problem scraping some modern web pages, particularly on pages generated by single-page applications (where the content is maintained by AJAX calls and DOM modification rather than delivered as ready-to-go HTML in a single server response). The best way I have found to access such content is to use the Selenium web testing environment to have a browser load the page under the control of my program, then extract the page contents from Selenium for scraping. There are other environments that will execute the scripts and modify the DOM appropriately, but I haven't used any of them. It's not as difficult as it sounds, but it will take you a little jiggering around to get there.
Web scraping can be greatly simplified when the site offers an API, be it officially supported or just an unofficial hack. Even the hack is better than trying to fiddle with the HTML which can change every day. So a search for morningstar api might be fruitful. And, in fact, some friendly Gister has already worked this out for you. Would the search be without result, a usually fruitful approach is to investigate what ajax calls the page is doing to retrieve data and then issue them directly. This can be achieved via the browser debuggers, tab "network" or so where each request can be investigated in detail in a very friendly UI.
I've found scraping dynamic sites to be a lot easier with JavaScript than with Python + Selenium. There is a great module for nodejs/phantomjs: ScraperJS. It is very easy to use: it injects jQuery into the scraped page and you can extract data with jQuery selectors.
getting specific images from page
I am pretty new with BeautifulSoup. I am trying to print image links from http://www.bing.com/images?q=owl: redditFile = urllib2.urlopen("http://www.bing.com/images?q=owl") redditHtml = redditFile.read() redditFile.close() soup = BeautifulSoup(redditHtml) productDivs = soup.findAll('div', attrs={'class' : 'dg_u'}) for div in productDivs: print div.find('a')['t1'] #works fine print div.find('img')['src'] #This getting issue KeyError: 'src' But this gives only title, not the image source Is there anything wrong? Edit: I have edited my source, still could not get image url.
Bing is using some techniques to block automated scrapers. I tried to print div.find('img') and found that they are sending source in attribute names src2, so following should work - div.find('img')['src2'] This is working for me. Hope it helps.
If you open up browser develop tools, you'll see that there is an additional async XHR request issued to the http://www.bing.com/images/async endpoint which contains the image search results. Which leads to the 3 main options you have: simulate that XHR request in your code. You might want to use something more suitable for humans than urllib2; see requests module. This would be so called "low-level" approach, going down to the bare metal and web-site specific implementation which would make this option non-reliable, difficult, "heavy", error-prompt and fragile automate a real browser using selenium - stay on the high-level. In other words, you don't care how the results are retrieved, what requests are made, what javascript needs to be executed. You just wait for search results to appear and extract them. use Bing Search API (this should probably be option #1)
Querying web pages with Python
I am learning web programming with Python, and one of the exercises I am working on is the following: I am writing a Python program to query the website "orbitz.com" and return the lowest airfare. The departure and arrival cities and dates are used to construct the URL. I am doing this using the urlopen command, as follows: (search_str contains the URL) from lxml.html import parse from urllib2 import urlopen parsed = parse(urlopen(search_str)) doc = parsed.getroot() links = doc.findall('.//a') the_link = (links[j].text_content()).strip() The idea is to retrieve all the links from the query results and search for strings such as "Delta", "United" etc, and read off the dollar amount next to the links. It worked successfully until today - It looks like orbitz.com has changed their output page. Now, when you enter the travel details on the orbitz.com website, there appears a page showing a wheel saying "looking up itineraries" or something to that effect. This is just a filler page and contains no real information. After a few seconds, the real results page is displayed. Unfortunately, the Python code return the links for the filler page each time, and I never obtain the real results. How can I get around this? I am a relative beginner to web programming, so any help is greatly appreciated.
This kind of things is normal in the world of crawlers. What you need to do is figure out what url it is redirecting to after the "itinerary page" and you hit that url directly from your script. Then figure out if they have changed the final search results page too, if so modify your script to accommodate those changes.