Identifying DJIA data using Beautiful Soup

Identifying DJIA data using Beautiful Soup - python

I'm fairly new to coding and am trying to write a script that would pull market data at timed intervals while running, then compare the delta between each pull and notify the user of the change - looking for simple shifts, let's say >.1% in any interval.
My initial approach is to run a Beautiful Soup script to obtain posted market data, using either Yahoo Finance or Barron's, as both seem to have the data available in the HTML code:
https://finance.yahoo.com/calendar
http://www.barrons.com/mdc/public/page/9_3000.html?mod=bol_mdc_topnav_9_3000
This is as far as I've gotten and not having much luck, the find function doesn't seem to be returning anything from the site - looking for any nudging that might help me get on the right track with this
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
import pandas as pd
url = 'https://finance.yahoo.com/calendar'
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
soup.find("span")
I would expect this to return the first span tag so I could later hone in on the DJIA data: "
span class="Trsdu(0.3s) Fz(s) Mt(4px) Mb(0px) Fw(b) D(ib)" data-reactid="31">26,430.14</span
but the script runs and returns nothing

You can use the same url the bottom of your listed urls is using to source the quote
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://quotes.wsj.com/index/DJIA?mod=mdc_uss_dtabnk')
soup = bs(r.content, 'lxml')
djia = soup.select_one('#quote_val').text
print(djia)
That is clear as source when you inspect the network traffic of the original bottom url you list and then focus on this url
http://www.barrons.com/mdc/public/js/9_3001_Refresh.js?
which has the js for refreshing that value. There you can see the listed source url for quote.
The response which contains:

Related

BeautifulSoup is missing content

I'm trying to scrap a number of visitors to my local climbing centre.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://portal.rockgympro.com/portal/public/c3b9019203e4bc4404983507dbdf2359/occupancy?&iframeid=occupancyCounter&fId=1644")
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find('span', id="count")
print(results)
It's printing this:
<span id="count" style="display:inline"></span>
That's nice, but the number 19 is missing... What am I doing wrong?

It's there in json format in the tag of the html. Just need to pull it out.
import requests
import json
from bs4 import BeautifulSoup
url = 'https://portal.rockgympro.com/portal/public/c3b9019203e4bc4404983507dbdf2359/occupancy?&iframeid=occupancyCounter&fId=1644'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
scriptStr = str(soup.find_all('script')[2]).split('var data = ')[-1].split(';')[0].replace("'",'"')
last_char_index = scriptStr.rfind(",")
scriptStr = scriptStr[:last_char_index] + '}'
scriptStr = scriptStr.replace('&nbsp', ' ')
jsonData = json.loads(scriptStr)
count = jsonData['REA']['count']
capacity = jsonData['REA']['capacity']
lastUpdate = jsonData['REA']['lastUpdate']
print(f'{count} of {capacity} Climbers\n{lastUpdate}')
Output:
58 of 220 Climbers
Last updated: now (5:20 PM)

You're not doing anything wrong, the issue is that the website is populating the <span> element using JavaScript, which runs after your request is made.
Unfortunately, the requests library cannot run JavaScript since it is a pure HTTP tool. I would recommend checking out something like Selenium which is more robust and can wait for the JavaScript to load before scraping the HTML.

You can try requests_html module to get dynamic values which are calculated by javascript. I tried with below logic it worked for me on your site.
from bs4 import BeautifulSoup
import time
from requests_html import HTMLSession
url="Your Site Link"
# create an HTML Session object
session = HTMLSession()
# Use the object above to connect to needed webpage
resp = session.get(url)
# Run JavaScript code on webpage
resp.html.render(sleep=10)
soup = BeautifulSoup(resp.html.html, 'lxml')
results = soup.find('span', id="count")
print(results)
Your Site calculate Result

In the dev tools under one of the tags, you can see that many of those figures are generated after the page load by the JavaScript function showGym(). In order to allow those figures to generate you could use a browser driver tool like webbot or Selenium which can wait on pages long enough for the javascript to execute populate those fields. It might be possible to have requests do that, but I don't know as I've only used webbot when reaching problems like these as it's very easy to use.

Browser code and beautifulsoup collection different

I try to parse soccerstand front page soccer matches and fail because the items I get with BeautifulSoup are really different from what I see in browser.
My code is simple at the moment:
import urllib.request
from bs4 import BeautifulSoup
with urllib.request.urlopen('https://soccerstand.com/') as response:
url_data = response.read()
soup = BeautifulSoup(url_data, 'html.parser')
print(soup.find_all('div.event__match'))
So I tried this and this failed. When I checked soup variable it turned out not to contain such divs at all, so what I get with BS is different from what I see by inspecting code on the website.
What's the reason for that? Is there any workaround?

Python Web Scraping Not Getting the Desire Contents

Web Scraping Noob here. I start my journey to stock investing, so I am trying to get data from Morningstar. However, I am not getting the data I want.
Take https://www.morningstar.com/stocks/xnas/aapl/financials as an example.
These will be the numbers I am interested.
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://www.morningstar.com/stocks/xnas/aapl/financials"
page = requests.get(url)
print(page.status_code)
soup = BeautifulSoup(page.content, 'html.parser')
The status_code is 200, which is fine. However, the parsed page contents did not contain the numbers I want. For example, no result is returned for soup.find_all('Price/Book').
I inspected the web page. The content stops at tag where I marked red.
soup.find("div", class_="mdc-column mds-layout-grid__col stock__content-sal mds-layout-grid__col--12 mds-layout-grid__col--auto-at-1092").contents
The code above returns an empty array. It should contain something () but does not. I don't think it has anything to do with js, but I don't know why beautiful soup fail to capture it either. Any hints are appreciated.
Hope I made myself clear enough.

Python BeautifulSoup trouble extracting titles from a page with JS

I'm having some serious issues trying to extract the titles from a webpage. I've done this before on some other sites but this one seems to be an issue because of the Javascript.
The test link is "https://www.thomasnet.com/products/adhesives-393009-1.html"
The first title I want extracted is "Toagosei America, Inc."
Here is my code:
import requests
from bs4 import BeautifulSoup
url = ("https://www.thomasnet.com/products/adhesives-393009-1.html")
r = requests.get(url).content
soup = BeautifulSoup(r, "html.parser")
print(soup.get_text())
Now if I run it like this, with get_text, i can find the titles in the result, however as soon as I change it to find_all or find, the titles are lost. I cant find them using web browser's inspect tool, because its all JS generated.
Any advice would be greatly appreciated.

You have to specify what to find, in this case <h2> to get first title:
import requests
from bs4 import BeautifulSoup
url = 'https://www.thomasnet.com/products/adhesives-393009-1.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
first_title = soup.find('h2')
print(first_title.text)
Prints:
Toagosei America, Inc.

Python scraping website with flight tickets

I am trying to extract information about prices of flight tickets with a python script. Please take a look at the picture:
I would like to parse all the prices (such as "121" at the bottom of the tree). I have constructed a simple script and my problem is that I am not sure how to get the right parts from the code behind page's "inspect element". My code is below:
import urllib3
from bs4 import BeautifulSoup as BS
http = urllib3.PoolManager()
ULR = "https://greatescape.co/?datesType=oneway&dateRangeType=exact&departDate=2019-08-19&origin=EAP&originType=city&continent=europe&flightType=3&city=WAW"
response = http.request('GET', URL)
soup = BS(response.data, "html.parser")
body = soup.find('body')
__next = body.find('div', {'id':'__next'})
ui_container = __next.find('div', {'class':'ui-container'})
bottom_container_root = ui_container.find('div', {'class':'bottom-container-root'})
print(bottom_container_root)
The problem is that I am stuck at the level of ui-container. bottom-container-root is an empty variable, despite it is a direct child under ui-container. Could someone please let me know how to parse this tree properly?
I have no experience in web scraping, but as it happens it is one step in a bigger workflow I am building.

.find_next_siblings and .next_element can be useful in navigating through containers.
Here is some example usage below.
from bs4 import BeautifulSoup
html = open("small.html").read()
soup = BeautifulSoup(html)
print soup.head.next_element
print soup.head.next_element.next_element

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Identifying DJIA data using Beautiful Soup - python

Related

BeautifulSoup is missing content

Browser code and beautifulsoup collection different

Python Web Scraping Not Getting the Desire Contents

Python BeautifulSoup trouble extracting titles from a page with JS

Python scraping website with flight tickets

Categories

Resources