BeautifulSoup Parses Table Incorrectly - python

Having trouble getting Beautiful Soup to process a large table of play-by-play basketball data properly. Code:
import urllib.request
from bs4 import BeautifulSoup
request = urllib.request.Request('http://www.basketball-reference.com/boxscores/pbp/201611220LAL.html')
result = urllib.request.urlopen(request)
resulttext = result.read()
soup = BeautifulSoup(resulttext, "html.parser")
pbpTable = soup.find('table', id="pbp")
If you run this example yourself, you will find that the table is not fully parsed- all we get out is this:
<table class="suppress_all sortable stats_table" data-cols-to-freeze="1" id="pbp">
<caption>Play-By-Play Table</caption>
<tr class="thead" id="q1">
<th colspan="6">1st Q</th></tr></table>
The problem is in the parsing itself printing the soup variable gives (among other things)
</div>
<div class="table_wrapper" id="all_pbp">
<div class="section_heading">
<span class="section_anchor" data-label="Play-By-Play" id="pbp_link"></span>
<h2>Play-By-Play</h2> <div class="section_heading_text">
<ul> <li>  Jump to: 1st | 2nd | 3rd | 4th <br> <span class="bbr-play-score key">scoring play</span> <span class="bbr-play-tie key">tie</span> <span class="bbr-play-leadchange key">lead change</span></br></li>
</ul>
</div>
</div> <div class="table_outer_container">
<div class="overthrow table_container" id="div_pbp">
<table class="suppress_all sortable stats_table" data-cols-to-freeze="1" id="pbp"><caption>Play-By-Play Table</caption><tr class="thead" id="q1">
<th colspan="6">1st Q</th></tr></table></div></div></div></div></div></body></html>
Most importantly, a /table tag appears out of nowhere. Viewing the page source of the relevant link we can see that the table is not closed there- it goes on for a while. Is there any fix for this besides implementing my own HTML parsing code?

Use "lxml" or "html5lib" instead of "html.parser" in
soup = BeautifulSoup(resulttext, "lxml")`
and you get more data.
But you may have to install lxml or html5lib if you don't have yet.
pip install lxml
pip install html5lib
lxml may need C/C++ compiler, libxml library (libxml.dll on Windows), etc.

Related

Parsing website with BeautifulSoup and Requests returns None

Im a beginner in programming all together and work on a project of mine. For that I'm trying to parse data from a website to make a tool that uses the data. I found that BeatifulSoup and Requests are common tools to do it, but unfortunately i can not seem to make it work. It always returns the value None or an error where it says:
"TypeError: 'NoneType' object is not callable"
Did i do anything wrong? Is it maybe not possible to parse some websites data and I'm being restricted the access or something?
If there are other ways to access the data im happy to hear as well.
Here is my code:
from bs4 import BeautifulSoup
import requests
pickrates = {} # dict to store winrate of champions for each position
source = requests.get("http://u.gg/lol/champions/aatrox/build?role=top").text
soup = BeautifulSoup(source, "lxml")
value = soup.find("div", class_="content-section champion-ranking-stats")
print(value.prettify())
Remember when you request a webpage with requests module, you will only get the html of that page. I mean this module is not capable of rendering JavaScript.
Try this code:
import requests
source = requests.get("http://u.gg/lol/champions/aatrox/build?role=top").text
print(source)
Then search for the class names you provided by hand(ctrl + f), there is no such elements at all. It means those are generated by other requests like ajax. They are somehow created after the initial html page is loaded. So before Beautiful soup comes to the party, you can't get them even in .text attribute of the response object.
One way of doing it is to Selenium or any other libraries which handles the JS.
It seems like this question (can't find html tag when I scrape web using beautifulsoup), the problem would be caused by the JavaScript event listener. I would suggest you to use selenium to handle this issue. So, let apply selenium at sending request and getting back page source and then use BeautifulSoup to parse it.
Don't forget to download a browser driver from https://www.selenium.dev/documentation/getting_started/installing_browser_drivers/ and place it in the same directory with your code.
The example of code below is using selenium with Firefox:
from selenium import webdriver
from bs4 import BeautifulSoup
URL = 'http://u.gg/lol/champions/aatrox/build?role=top'
browser = webdriver.Firefox()
browser.get(URL)
soup = BeautifulSoup(browser.page_source, 'html.parser')
time.sleep(1)
browser.close()
value = soup.find("div", class_="content-section champion-ranking-stats")
print(value.prettify())
Your expected output would be like:
>>> print(value.prettify())
<div class="content-section champion-ranking-stats">
<div class="win-rate meh-tier">
<div class="value">
48.4%
</div>
<div class="label">
Win Rate
</div>
</div>
<div class="overall-rank">
<div class="value">
49 / 58
</div>
<div class="label">
Rank
</div>
</div>
<div class="pick-rate">
<div class="value">
3.6%
</div>
<div class="label">
Pick Rate
</div>
</div>
<div class="ban-rate">
<div class="value">
2.3%
</div>
<div class="label">
Ban Rate
</div>
</div>
<div class="matches">
<div class="value">
55,432
</div>
<div class="label">
Matches
</div>
</div>
</div>

Python: scrape a part of source code and save it as html

Here is the case, I need to save a web page's source code as html file. But if you look at the web page, there are lots of section, I don't need them, I only want to save the source code of the article itself.
code:
from urllib.request import urlopen
page = urlopen('http://www.abcde.com')
page_content = page.read()
with open('page_content.html', 'wb') as f:
f.write(page_content)
I can save the whole source code from my code, but how can I just save the only part I want?
Explain:
<div itemscope itemtype="http://schema.org/MedicalWebPage">
.
.
.
</div>
I need to save the source code with and inside this tag , not extract the sentences in the tags.
The result I want is to save like this:
<div itemscope itemtype="http://schema.org/MedicalWebPage">
<div class="col-md-12 col-xs-12" style="padding-left:10px;">
<h1 itemprop="name" class="page_article_title" title="Apple" id="mask">Apple</h1>
</div>
<!--Article Start-->
<section class="page_article_div" id="print">
<article itemprop="text" class="page_article_content">
<p>
<img alt="Apple" src="http://www.abcde.com/383741719.jpg" style="width: 300px; height: 200px;" /></p>
<p>
The apple tree (Malus pumila, commonly and erroneously called Malus domestica) is a deciduous tree in the rose family best known for its sweet, pomaceous fruit, the apple.</p>
<p>
It is cultivated worldwide as a fruit tree, and is the most widely grown species in the genus Malus.</p>
<p>
<strong><span style="color: #884499;">Appe is red</span></strong></p>
<ol>
<li>
Germanic paganism</li>
<li>
Greek mythology</li>
</ol>
<p style="text-align: right;">
【Jane】</p>
<p style="text-align: right;">
Credit : Wiki</p>
</article>
<div style="text-align:right;font-size:1.2em;"><a class="authorlink" href="http://www.abcde.com/web/online;url=http://61.66.117.1234/name=2017">2017</a></div>
<br />
<div style="text-align:right;font-size:1.2em;">【Thank you!】</div>
</section>
<!--Article End-->
</div>
My own solution here:
page = urlopen('http://www.abcde.com')
page_content = page.read()
soup = BeautifulSoup(page_content, "lxml")
list = []
for tag in soup.select('div[itemtype="http://schema.org/MedicalWebPage"]'):
list.append(str(tag))
list2= (', '.join(list))
#print(list2)
#print(type(list2))
with open('C:/html/try.html', 'w',encoding='UTF-8') as f:
f.write(list2)
I am a beginner so I am trying to do it as simple as it is, and this is my answer, it's working quite well at the moment :)
You can search with the tag with the property of tag such as class or tag name or id and save it to the what ever format you want like the example below.
driver = BeautifulSoup(yoursavedfile.read(), 'html.parser')
tag_for_me = driver.find_elements_by_class_name('class_name_of_your_tag')
print tag_for_me
tag_for_me will have your required code.
You can use Beautiful Soup to get any HTML source you need.
import requests
from bs4 import BeautifulSoup
target_class = "gb4"
target_text = "Web History"
r = requests.get("https://google.com")
soup = BeautifulSoup(r.text, "lxml")
for elem in soup.find_all(attrs={"class":target_class}):
if elem.text == target_text:
print(elem)
Output:
<a class="gb4" href="http://www.google.com/history/optout?hl=en">Web History</a>
Use BeautifulSoup to get the HTML where you want to insert, get the HTML which you want to insert. use insert() to generate new_tag. Overwrite to the original file.
from bs4 import BeautifulSoup
import requests
#Use beautiful soup to get the place you want to insert.
# div_tag is extracted div
soup = BeautifulSoup("Your content here",'lxml')
div_tag = soup.find('div',attrs={'class':'id=itemscope'})
#e.g
#div_tag = <div id=itemscope itemtype="http://schema.org/MedicalWebPage">
</div>
res = requests.get('url to get content from')
soup1 = BeautifulSoup(res.text,'lxml')
insert_data = soup1.find('your div/data to insert')
#this will insert the tag to div_tag. You can overwrite this to your original page_content.html.
div_tag.insert(3,insert_data)
#div tag contains you desired output. Overwrite it to original file.

BeautifulSoup scraping returns {{}} that does not have data

I am trying to parse a site that looks like below:
<div class="address">
<div class="hit-company">Amy Gold</div>
<div class="speciality hit-speciality">Audiology</div>
<div class="address hit-address"><i><p translate="no">
<span class="address-line1">38 Park Drive </span><br>
<span class="locality">London</span>, <span class="administrative-area">VA</span> <span class="postal-code">22025</span><br>
</p></i></div>
<div class="phone hit-phone"><i>(xxx) 659-xxx</i></div>
<div class="description hit-listing_description hidden-xs"></div>
<div class="hit-website">Visit Website</div>
</div>
Used beautiful soups to scrape this:`
import os
from urllib.request import Request, urlretrieve, urlopen
from bs4 import BeautifulSoup
req = Request("https://www.urlxxxxxx.com", headers={'User-Agent': 'Mozilla/5.0'})
page1 = urlopen(req)
phtml = BeautifulSoup(page1, 'html5lib') print(phtml)
divs = phtml.find_all("div", attrs={"class":"hit-company"})
print('aaaaa-----' + str(divs))`
Tried with html5lib, lxml, html.parser. lxml and html.parser do not even pick up the div class "hit-company" only html5lib does. even with html5lib, divs is coming out to be an empty.
When I examine the html output I notice
<div class="hit-company{{person}}</div>
<div class="speciality hit-speciality">{{specialty}}</div>
<span class="address-line1">{{address}}</span><br>
The actual data is being placed by {{paratemer x}}. Can you please help solve this?
Thanks
Based on the comment by #crossal:
The site being scraped (an internal page) had dynamically generated content, and the problem was solved by using selenium and phantomJS.

Scraping the links from a specific url

this is my first question if I have explained anything wrong please forgive me.
I am trying scrape url's from a specific website in python and parse the links to a csv. The thing is when i parse the website in BeautifulSoup I can't extract the url's because when I parse it in python I can only get <div id="dvScores" style="min-height: 400px;">\n</div>, and nothing under that branch. But when I open the console and copy the table where the links are and paste it to a text editor it pastes 600 pages of html. What I want to do is to write a for loop that shows the links. The structure of the html is below:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
#shadow-root (open)
<head>...</head>
<body>
<div id="body">
<div id="wrapper">
#multiple divs but i don't need them
<div id="live-master"> #what I need is under this div
<span id="contextual">
#multiple divs but i don't need them
<div id="live-score-master"> #what I need is under this div
<div ng-app="live-menu" id="live-score-rightcoll">
#multiple divs but i don't need them
<div id="left-score-lefttemp" style="padding-top: 35px;">
<div id="dvScores">
<table cellspacing=0 ...>
<colgroup>...</colgroup>
<tbody>
<tr class="row line-bg1"> #this changes to bg2 or bg3
<td class="row">
<span class="row">
<a href="www.example.com" target="_blank" class="td_row">
#I need to extract this link
</span>
</td>
#Multiple td's
</tr>
#multiple tr class="row line-bg1" or "row line-bg2"
.
.
.
</tbody>
</table>
</div>
</div>
</div>
</div>
</span>
</div>
</div>
</body>
</html>
What am I doing wrong? I need to automate a system for python to do rather than pasting the html to text and extracting links with a regex.
My python code is below also:
import requests
from bs4 import BeautifulSoup
r=requests.get("http://example.com/example")
c=r.content
soup=BeautifulSoup(c,"html.parser")
all=soup.find_all("span",id="contextual")
span=all[0].find_all("tbody")
if you are trying scrape urls then you should get hrefs :
urls = soup.find_all('a', href=True)
This site uses JavaScript for populating its content, therefore, you can't get url via beautifulsoup. If you inspect network tab in your browser you can spot a this link. It contains all data what you need. You can simply parse it and extract all desired value.
import requests
req = requests.get('http://goapi.mackolik.com/livedata?group=0').json()
for el in req['m'][4:100]:
index = el[0]
team_1 = el[2].replace(' ', '-')
team_2 = el[4].replace(' ', '-')
print('http://www.mackolik.com/Mac/{}/{}-{}'.format(index, team_1, team_2))
It seems like the html is being dynamically generated by js. You would need to crawl it with a crawler to mimic a browser. Since you are using requests, it already has a crawler session.
session = requests.session()
data = session.get ("http://website.com").content #usage xample
After this you can do the parsing, additional scraping, etc.

Extract Text from HTML Python (BeautifulSoup, RE, Other Option?)

I am familiar with BeautifulSoup and Regular Expressions as a means of extracting text from HTML but not as familiar with others, such as ElementTree, Minidom, etc.
My question is fairly straightforward. Given the HTML snippet below, which library is best for extracting the text below? The text being the integer.
<td class="tl-cell tl-popularity" data-tooltip="7,944,796" data-tooltip-instant="">
<div class="pop-meter">
<div class="pop-meter-background"></div>
<div class="pop-meter-overlay" style="width: 55%"></div>
</div>
</td>
With BeautifulSoup it is fairly straight-forward:
from bs4 import BeautifulSoup
data = """
<td class="tl-cell tl-popularity" data-tooltip="7,944,796" data-tooltip-instant="">
<div class="pop-meter">
<div class="pop-meter-background"></div>
<div class="pop-meter-overlay" style="width: 55%"></div>
</div>
</td>
"""
soup = BeautifulSoup(data)
print(soup.td['data-tooltip'])
If you have multiple td elements and you need to extract the data-tooltip from each one:
for td in soup.find_all('td', {'data-tooltip': True}):
print(td['data-tooltip'])

Categories