Python, BeautifulSoup - Extracting part of a string - python

I'm working on a side project to see if I can predict the wins on a website, however this is one of the first times I've used BeautifulSoup and I'm not entirely sure on how to cut a string down to size.
Here is the code, I basically want to grab the information that holds where it crashed.
from bs4 import BeautifulSoup
from urllib import urlopen
html = urlopen('https://www.csgocrash.com/game/1/1287324').read()
soup = BeautifulSoup(html)
for section in soup.findAll('div',{"class":"row panel radius"}):
crashPoint = section.findChildren()[2]
print crashPoint
Upon running it, I get this as the output.
<p> <b>Crashed At: </b> 1.47x </p>
I want to basically only want to grab the number value, which would require me to cut from both sides, I just don't know how to go about doing this, as well as removing the HTML tags.

Find the Crashed At label by text and get the next sibling:
soup = BeautifulSoup(html, "html.parser")
for section in soup.findAll('div', {"class":"row panel radius"}):
crashPoint = section.find("b", text="Crashed At: ").next_sibling.strip()
print(crashPoint) # prints 1.47x
Also, not sure if you need a loop in this case since there is a single Crashed At value:
from bs4 import BeautifulSoup
from urllib import urlopen
html = urlopen('https://www.csgocrash.com/game/1/1287324').read()
soup = BeautifulSoup(html, "html.parser")
section = soup.find('div', {"class":"row panel radius"})
crashPoint = section.find("b", text="Crashed At: ").next_sibling.strip()
print(crashPoint)

Related

Python - Extracting info from website using BeautifulSoup

I am new to BeautifulSoup, and I'm trying to extract data from the following website.
https://excise.wb.gov.in/CHMS/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx
I am trying to extract the availability of the hospital beds information (along with the detailed breakup) after choosing a particular district and also with the 'With available bed only' option selected.
Should I choose the table, the td, the tbody, or the div class for this instance?
My current code:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://excise.wb.gov.in/CHMS/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx').text
soup = BeautifulSoup(html_text, 'lxml')
locations= soup.find('div', {'class': 'col-lg-12 col-md-12 col-sm-12'})
print(locations)
This only prints out a blank output:
Output
I have also tried using tbody and from table still could not work it out.
Any help would be greatly appreciated!
EDIT: Trying to find a certain element returns []. The code -
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://excise.wb.gov.in/CHMS/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx').text
soup = BeautifulSoup(html_text, 'lxml')
location = soup.find_all('h5')
print(location)
It is probably a dynamic website, it means that when you use bs4 for retrieving data it doesn't retrieve what you see because the page updates or loads the content after the initial HTML load.
For these dynamic webpages you should use selenium and combine it with bs4.
https://selenium-python.readthedocs.io/index.html

How can i make BeautifulSoup finde more than one tag?

I iam testing a programm for printing the content of wikipedia into the prompt. I alrready got some output but its a bit messy. So i thought to only get the content of tags <p> and <b> that are the two which wikipedia uses to show the content. Here is my code:
import urllib.request
from bs4 import BeautifulSoup
URL = input("Enter the url (only wikipedia supported, default url https://?.wikipedia.org/wiki) : ")
page = urllib.request.urlopen(URL)
html_doc = page.read()
soup = BeautifulSoup(html_doc, 'html.parser')
for x in soup.find_all('p').find_all('b'):
print(x.string)
The interrogation mark is because wikipedia shows the lenguage there so it depends. As you see i added one more .find_all with because i didn´t know how to add it. Sorry for my bad english and my bad code because i am not very related to this request field. Thanks
BeautifulSoup.find_all returns a ResultSet which is essentially a list of elements. You need to iterate through that list for the second operation yourself.
import urllib.request
from bs4 import BeautifulSoup
URL = input("Enter the url (only wikipedia supported, default url https://?.wikipedia.org/wiki) : ")
page = urllib.request.urlopen(URL)
html_doc = page.read()
soup = BeautifulSoup(html_doc, 'html.parser')
for elem in soup.find_all('p'):
for x in elem.find_all('b'):
print(x.string)

BeautifulSoap does not work on subsequent pages

I cannot get the title of subsequent pages. Where is the problem?
from bs4 import BeautifulSoup
import urllib.request
# First page
source = urllib.request.urlopen('https://yeniemlak.az/elan/axtar?emlak=1&elan_nov=1&seher=0&metro=0&qiymet=&qiymet2=&mertebe=&mertebe2=&otaq=&otaq2=&sahe_m=&sahe_m2=&sahe_s=&sahe_s2=').read()
soup = BeautifulSoup(source,'lxml')
print(soup.title) # shows title as expected
# Second page
source = urllib.request.urlopen('https://yeniemlak.az/elan/axtar?emlak=1&elan_nov=1&seher=0&metro=0&qiymet=&qiymet2=&mertebe=&mertebe2=&otaq=&otaq2=&sahe_m=&sahe_m2=&sahe_s=&sahe_s2=&page=2').read()
soup = BeautifulSoup(source,'lxml')
print(soup.title) # shows None
Unsure why only your second case is failing. As mentioned in some other SO thread, sometimes using other parsers might work.
I could get the second page to work fine with html.parser. Though it threw a warning about decoding errors.
from bs4 import BeautifulSoup
import urllib.request
# Second page
source = urllib.request.urlopen('https://yeniemlak.az/elan/axtar?emlak=1&elan_nov=1&seher=0&metro=0&qiymet=&qiymet2=&mertebe=&mertebe2=&otaq=&otaq2=&sahe_m=&sahe_m2=&sahe_s=&sahe_s2=&page=2').read()
soup = BeautifulSoup(source,'html.parser')
print(soup.title) # Now works
Output
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
<title>YENIEMLAK.AZ Satılır Bina ev menzil </title>

Identifying DJIA data using Beautiful Soup

I'm fairly new to coding and am trying to write a script that would pull market data at timed intervals while running, then compare the delta between each pull and notify the user of the change - looking for simple shifts, let's say >.1% in any interval.
My initial approach is to run a Beautiful Soup script to obtain posted market data, using either Yahoo Finance or Barron's, as both seem to have the data available in the HTML code:
https://finance.yahoo.com/calendar
http://www.barrons.com/mdc/public/page/9_3000.html?mod=bol_mdc_topnav_9_3000
This is as far as I've gotten and not having much luck, the find function doesn't seem to be returning anything from the site - looking for any nudging that might help me get on the right track with this
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
import pandas as pd
url = 'https://finance.yahoo.com/calendar'
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
soup.find("span")
I would expect this to return the first span tag so I could later hone in on the DJIA data: "
span class="Trsdu(0.3s) Fz(s) Mt(4px) Mb(0px) Fw(b) D(ib)" data-reactid="31">26,430.14</span
but the script runs and returns nothing
You can use the same url the bottom of your listed urls is using to source the quote
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://quotes.wsj.com/index/DJIA?mod=mdc_uss_dtabnk')
soup = bs(r.content, 'lxml')
djia = soup.select_one('#quote_val').text
print(djia)
That is clear as source when you inspect the network traffic of the original bottom url you list and then focus on this url
http://www.barrons.com/mdc/public/js/9_3001_Refresh.js?
which has the js for refreshing that value. There you can see the listed source url for quote.
The response which contains:

Beautiful Soup not returning everything in HTML file?

HTML noob here, so I could be misunderstanding something about the HTML document, so bear with me.
I'm using Beautiful Soup to parse web data in Python. Here is my code:
import urllib
import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup.BeautifulSoup(page)
indicateGameDone = str(soup.find("div", {"class": "nbaModTopStatus"}))
print indicateGameDone
now, if you look at the website, the HTML code has the line <p class="nbaLiveStatTxSm"> FINAL </p>, (inspect the 'Final' text on the left side of the container on the first ATL-WAS game on the page to see it for youself.) But when I run the code above, my code doesn't return the 'FINAL' that is seen on the webpage, and instead the nbaLiveStatTxSm class is empty.
On my machine, this is the output when I print indicateGameDone:
<div class="nbaModTopStatus"><p class="nbaLiveStatTx">Live</p><p class="nbaLiveStatTxSm"></p><p class="nbaFnlStatTx">Final</p><p class="nbaFnlStatTxSm"></p></div>
Does anyone know why this is happening?
EDIT: clarification: the problem isn't retrieving the text within the tag, the problem is that when I take the html code from the website and print it out in python, something that I saw when I inspected the element on the web is not there in the print statement in Python.
You can use this logic to extract any text.
This code allows you to extract any data between any tags.
Output - FINAL
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url)
soup = BeautifulSoup(page)
indicateGameDone = soup.find("div", {"class": "nbaFnlStatTx"})
for p in indicateGameDone:
p_text = soup.find("p", {"class": "nbaFnlStatTxSm"})
print(p_text.getText())
break;
It looks like your problem is not with BeautifulSoup but instead with urllib.
Try running the following commands
>>> import urllib
>>> url = "http://www.nba.com/gameline/20160323/"
>>> page = urllib.urlopen(url).read()
>>> page.find('<div class="nbaModTopStatus">')
44230
Which is no surprise considering that Beautiful Soup was able to find the div itself. However when we look a little deeper into what urllib is actually collecting we can see that the <p class="nbaFnlStatTxSm"> is empty by running
>>> page[44230:45000]
'<div class="nbaModTopStatus"><p class="nbaLiveStatTx">Live</p><p class="nbaLiveStatTxSm"></p><p class="nbaFnlStatTx">Final</p><p class="nbaFnlStatTxSm"></p></div><div id="nbaGLBroadcast"><img src="/.element/img/3.0/sect/gameline/broadcasters/lp.png"></div><div class="nbaTeamsRow"><div class="nbaModTopTeamScr nbaModTopTeamAw"><h5 class="nbaModTopTeamName awayteam">ATL</h5><img src="http://i.cdn.turner.com/nba/nba/.element/img/2.0/sect/gameline/teams/ATL.gif" width="34" height="22" title="Atlanta Hawks"><h4 class="nbaModTopTeamNum win"></h4></div><div class="nbaModTopTeamScr nbaModTopTeamHm"><h5 class="nbaModTopTeamName hometeam">WAS</h5><img src="http://i.cdn.turner.com/nba/nba/.element/img/2.0/sect/gameline/teams/WAS.gif" width="34" '
You can see that the tag is empty, so your problem is the data that's being passed to Beautiful Soup, not the package itself.
changed the import of beautifulsoup to the proper syntax for the current version of BeautifulSoup
corrected the way you were constructing the BeautifulSoup object
fixed your find statement, then used the .text command to get the string representation of the text in the HTML you're after.
With some minor modifications to your code as listed above, your code runs for me.
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup(page)
indicateGameDone = soup.find("div", {"class": "nbaModTopStatus"})
print indicateGameDone.text ## "LiveFinal "
to address comments:
import urllib
from bs4 import BeautifulSoup
url = "http://www.nba.com/gameline/20160323/"
page = urllib.urlopen(url).read()
soup = BeautifulSoup(page)
indicateGameDone = soup.find("p", {"class": "nbaFnlStatTx"})
print indicateGameDone.text

Categories