I'm using python's bs4 module to parse HTML. However, I've run into a peculiar bug.
When downloading page HTML's for parsing, I've noticed BS4 will recognize div objects in some pages but not others, even though the specific object I'm referencing is present in both and the paths are the same.
e.g.
<div class = "item" data-year = "19-20">
<div class = "irrelevant">...</div>
<div class = "irrelevant">...</div>
<div class = "stats-grids">...</div>
<div class = "irrelevant">...</div>
</div>
I've done some digging and see it frequently proposed that something like this can be caused by Java use in the webpage, and not showing up in the HTML. However, I believe this not to be true in this case because BS4 is correctly identifying the path in other instances where the code remains unchanged.
When using...
res = requests.get('examplesite.com')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
element = soup.select('div[data-year = "19-20"] > div[class = "stats-grids"]')
For some pages, from the same website, element is correct. Other times, it can find div[data-year = "19-20"] and div[class = "stats-grids"] independently of one another, but not when I specify that one is the child of the other.
In other words, it's there, but only when I specify that stats grids is within the data year, it doesn't show up.
This may occur due to the site having incorrect HTML (for example a tag is not closed).
Try using html5lib. It will attempt to create a well-formed HTML document by adding additional tags.
Install it using
pip install html5lib
and specify it in the BeautifulSoup constructor
soup = bs4.BeautifulSoup(res.text, 'html5lib')
Ref:
Differences between parsers
Related
I am trying to scrape some data off of a FanGraphs webpage as well as interact with the page itself. Since there are many buttons and dropdowns on the page to narrow down my search results, I need to be able to find the corresponding elements in the HTML. However, when I tried to use a 'classic' approach and use modules like requests and urllib.requests, the portions of the HTML containing the data I need did not appear.
HTML Snippet
Here is a part of the HTML which contains the elements which I need.
<div id="root-season-grid">
<div class="season-grid-wrapper">
<div class="season-grid-title">Season Stat Grid</div>
<div class="season-grid-controls">
<div class="season-grid-controls-button-row">
<div class="fgButton button-green active isActive">Batting</div>
<div class="fgButton button-green">Pitching</div>
<div class="spacer-v-20"></div>
<div class="fgButton button-green active isActive">Normal</div>
<div class="fgButton button-green">Normal & Changes</div>
<div class="fgButton button-green">Year-to-Year Changes</div>
</div>
</div>
</div>
</div>
</div>
The full CSS path:
html > body > div#wrapper > div#content > div#root-season-grid div.season-grid-wrapper > div.season-grid-controls > div.season-grid-controls-button-row
Attempts
requests and bs4
>>> res = requests.get("https://fangraphs.com/leaders/season-stat-grid")
>>> soup = bs4.BeautifulSoup4(res.text, features="lxml")
>>> soup.select("#root-season-grid")
[<div id="root-season-grid"></div>]
>>> soup.select(".season-grid-wrapper")
[]
So bs4 was able to find the <div id="root-season-grid"></div> element, but could not find any descendants of that element.
urllib and lxml
>>> res = urllib.request.urlopen("https://fangraphs.com/leaders/season-stat-grid")
>>> parser = lxml.etree.HTMLParser()
>>> tree = lxml.etree.parse(res, parser)
>>> tree.xpath("//div[#id='root-season-grid']")
[<Element div at 0x131e1b3f8c0>]
>>> tree.xpath("//div[#class='season-grid-wrapper']")
[]
Again, no descendants of the div element could be found, this time with lxml.
I started to wonder if I should be using a different URL address to pass to both requests.get() and urlopen(), so I created a selenium remote browser, browser, then passed browser.current_url to both function. Unfortunately, the results were identical.
selenium
I did notice however, that using selenium.find_element_by_* and selenium.find_elements_by_* were able to find the elements, so I started using that. However, doing so took a lot of memory and was extremely slow.
selenium and bs4
Since selenium.find_element_by_* worked properly, I came up with a very hacky 'solution'. I selected the full HTML by using the "*" CSS selector then passed that to bs4.BeautifulSoup()
>>> browser = selenium.webdriver.Firefox()
>>> html_elem = browser.find_element_by_css_selector("*")
>>> html = html_elem.get_attribute("innerHTML")
>>> soup = bs4.BeautifulSoup(html, features="lxml")
>>> soup.select("#root-season-grid")
[<div id="root-season-grid"><div class="season-grid-wrapper">...</div></div>]
>>> soup.select(".season-grid-wrapper")
[<div class="season-grid-wrapper">...</div>]
So this last attempt was somewhat of a success, as I was able to get the elements I needed. However, after running a bunch of unit test and a few integration tests for the module, I realized how inconsistent this is.
Problem
After doing a bunch of research, I concluded the reason why Attempts (1) and (2) didn't work and why Attempt (3) is inconsistent is because the table in the page is rendered by JavaScript, along with the buttons and dropdowns. This also explains why the HTML above is not present when you click View Page Source. It seems that, when requests.get() and urlopen() are called, the JavaScript is not fully rendered, and whether bs4+selenium works depends on how fast the JavaScript renders. Are there any Python libraries which can render the JavaScript before returning the HTML content?
Hopefully this isn't too long of a question. I tried to condense as far as possible without sacrificing clarity.
Just get the page_source from Selenium and pass it to bs4.
browser.get("https://fangraphs.com/leaders/season-stat-grid")
soup = bs4.BeautifulSoup(browser.page_source, features="lxml")
print(soup.select("#root-season-grid"))
I'd recommend using their api however https://www.fangraphs.com/api/leaders/season-grid/data?position=B&seasonStart=2011&seasonEnd=2019&stat=WAR&pastMinPt=400&curMinPt=0&mode=normal
Pretty new to python... and I'm trying to my hands at my first project.
Been able to replicate few simple demo... but i think there are few extra complexities with what I'm trying to do.
I'm trying to scrape the gamelogs for from the NHL website
Here is that i came up with... similar code work for the top section of the site (ex: get the age) but it fail on the section with display logic (dependent if the user click on Career, game Logs or splits)
Thanks in advance for your help
import urllib2
from bs4 import BeautifulSoup
url = 'https://www.nhl.com/player/ryan-getzlaf-8470612?stats=gamelogs-r-nhl&season=20162017'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
Test = soup.find_all('div', attrs={'id': "gamelogsTable"})
This happens with many web pages. It's because some of the content is downloaded by Javascript code that is part of the initial download. By doing does this designers are able to show visitors the most important parts of a page without waiting for the entire page to download.
When you want to scrape a page the first thing you should do is to examine the source code for it (often using Ctrl-u in a Windows environment) to see if the content you require is available. If not then you will need to use something beyond BeautifulSoup.
>>> getzlafURL = 'https://www.nhl.com/player/ryan-getzlaf-8470612?stats=gamelogs-r-nhl&season=20162017'
>>> import requests
>>> import selenium.webdriver as webdriver
>>> import lxml.html as html
>>> import lxml.html.clean as clean
>>> browser = webdriver.Chrome()
>>> browser.get(getzlafURL)
>>> content = browser.page_source
>>> cleaner = clean.Cleaner()
>>> content = cleaner.clean_html(content)
>>> doc = html.fromstring(content)
>>> type(doc)
<class 'lxml.html.HtmlElement'>
>>> open('c:/scratch/temp.htm', 'w').write(content)
775838
By searching within the file temp.htm for the heading 'Ryan Getzlaf Game Logs' I was able to find this section of HTML code. As you can see, it's about what you expected to find in the original downloaded HTML. However, this additional step is required to get at it.
</div>
</li>
</ul>
<h5 class="statistics__subheading">Ryan Getzlaf Game Logs</h5>
<div id="gamelogsTable"><div class="responsive-datatable">
I should mention that there are alternative ways of accessing such code, one of them being dryscrape. I simply can't be bothered installing that one on this Windows machine.
I am currently going through the Web Scraping section of AutomateTheBoringStuff and trying to write a script that extracts translated words from Google Translate using BeautifulSoup4.
I inspected the html content of a page where 'Explanation' is the translated word:
<span id="result_box" class="short_text" lang="en">
<span class>Explanation</span>
</span>
Using BeautifulSoup4, I tried different selectors but nothing would return the translated word. Here are a few examples I tried, but they return no results at all:
soup.select('span[id="result_box"] > span')
soup.select('span span')
I even copied the selector directly from the Developer Tools, which gave me #result_box > span. This again returns no results.
Can someone explain to me how to use BeautifulSoup4 for my purpose? This is my first time using BeautifulSoup4 but I think I am using BeautifulSoup more or less correctly because the selector
soup.select('span[id="result_box"]')
gets me the outer span element**
[<span class="short_text" id="result_box"></span>]
**Not sure why the 'leng="en"' part is missing but I am fairly certain I have located the correct element regardless.
Here is the complete code:
import bs4, requests
url = 'https://translate.google.ca/#zh-CN/en/%E6%B2%BB%E5%85%B7'
res = requests.get(url)
res.raise_for_status
soup = bs4.BeautifulSoup(res.text, "html.parser")
translation = soup.select('#result_box span')
print(translation)
EDIT: If I save the Google Translate page as an offline html file and then make a soup object out of that html file, there would be no problem locating the element.
import bs4
file = open("Google Translate.html")
soup = bs4.BeautifulSoup(file, "html.parser")
translation = soup.select('#result_box span')
print(translation)
The result_box div is the correct element but your code only works when you save what you see in your browser as that includes the dynamically generated content, using requests you get only the source itself bar any dynamically generated content. The translation is generated by an ajax call to the url below:
"https://translate.google.ca/translate_a/single?client=t&sl=zh-CN&tl=en&hl=en&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&ie=UTF-8&oe=UTF-8&source=bh&ssel=0&tsel=0&kc=1&tk=902911.786207&q=%E6%B2%BB%E5%85%B7"
For your requests it returns:
[[["Fixture","治具",,,0],[,,,"Zhì jù"]],,"zh-CN",,,[["治 具",1,[["Fixture",999,true,false],["Fixtures",0,true,false],["Jig",0,true,false],["Jigs",0,true,false],["Governance",0,true,false]],[[0,2]],"治具",0,1]],1,,[["ja"],,[1],["ja"]]]
So you will either have to mimic the request, passing all the necessary parameters or use something that supports dynamic content like selenium
Simply try this :
translation = soup.select('#result_box span')[0].text
print(translation)
You can try this diferent aproach:
if filename.endswith(extension_file):
with open(os.path.join(files_from_folder, filename), encoding='utf-8') as html:
soup = BeautifulSoup('<pre>' + html.read() + '</pre>', 'html.parser')
for title in soup.findAll('title'):
recursively_translate(title)
FOR THE COMPLETE CODE, PLEASE SEE HERE:
https://neculaifantanaru.com/en/python-code-text-google-translate-website-translation-beautifulsoup-library.html
or HERE:
https://neculaifantanaru.com/en/example-google-translate-api-key-python-code-beautifulsoup.html
Hi I'm quite new to python and my boss has asked me to scrape this data however it is not my strong point so i was wondering how i would go about this.
The text that I'm after also changes in the quote marks every few minutes so I'm also not sure how to locate that.
I am using beautiful soup at the moment and Lxml however if there are better alternatives I'm happy to try them
This is the inspected element of the webpage:
div class = "sometext"
<h3> somemoretext </h3>
<p>
<span class = "title" title="text i want">text i want</span>
<br>
</p>
I have tried using:
from lxml import html
import requests
from bs4 import BeautifulSoup
page = requests.get('the url')
soup = BeautifulSoup(page.text)
r = soup.findAll('//span[#class="title"]/text()')
print r
Thank you in advance,any help would be appreciated!
First do this to get what you are looking at in the soup:
soup = BeautifulSoup(page)
print soup
That way you can double check that you are actually dealing will what you think you are dealing with.
Then do this:
r = soup.findAll('span', attrs={"class":"title"})
for span in r:
print span.text
This will get all the span tags with a class=title, and then text will print out all the text in between the tags.
Edited to Add
Note that esecules' answer will get you the title within the tag (<span class = "title" title="text i want">) whereas mine will get the title from the text (<span class = "title" >text i want</span>)
perhaps find is the method you really need since you're only ever looking for one element. docs
r = soup.find('div', 'sometext').find('span','title')['title']
if you're familiar with XPath and you don't need feature that specific to BeautifulSoup, then using lxml only is enough (or maybe even better since lxml is known to be faster) :
from lxml import html
import requests
page = requests.get('the url')
root = html.fromstring(page.text)
r = root.xpath('//span[#class="title"]/text()')
print r
I'd like to get items from a website with BeautifulSoup.
<div class="post item">
The target tag is this.
The tag has two attrs and white space.
First, I wrote,
roots = soup.find_all("div", "post item")
But, it didn't work.
Then I wrote,
html.find_all("div", {'class':['post', 'item']})
I could get items with this,but I am nost sure if this is correct or not.
is this code correct?
//// Additional ////
I am sorry,
html.find_all("div", {'class':['post', 'item']})
didn't work properly.
It also extracts class="item".
And, I had to write,
soup.find_all("div", class_="post item")
not = but _=. Although this doesn't work for me...(>_<)
Target url:
https://flipboard.com/section/%E3%83%8B%E3%83%A5%E3%83%BC%E3%82%B9-3uscfrirj50pdtqb
mycode:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from urllib.request import urlopen
from bs4 import BeautifulSoup
def main():
target = "https://flipboard.com/section/%E3%83%8B%E3%83%A5%E3%83%BC%E3%82%B9-3uscfrirj50pdtqb"
html = urlopen(target)
soup = BeautifulSoup(html, "html.parser")
roots = soup.find_all("div", class_="post item")
print(roots)
for root in roots:
print("##################")
if __name__ == '__main__':
main()
You could use a css select:
soup.select("div.post.item")
Or use class_
.find_all("div", class_="post item")
The docs suggest that *If you want to search for tags that match two or more CSS classes, you should use a CSS selector as per the first example.
The give example of both uses:
You can also search for the exact string value of the class attribute:
css_soup.find_all("p", class_="body strikeout")
# [<p class="body strikeout"></p>]
If you want to search for tags that match two or more CSS classes, you should use a CSS selector:
css_soup.select("p.strikeout.body")
# [<p class="body strikeout"></p>]
Why your code fails why and any of the above solutions would fail has more to do with the fact the class does not exist in the source, it it were there they would all work:
In [6]: r = requests.get("https://flipboard.com/section/%E3%83%8B%E3%83%A5%E3%83%BC%E3%82%B9-3uscfrirj50pdtqb")
In [7]: cont = r.content
In [8]: "post item" in cont
Out[8]: False
If you look at the browser source and do a search you won't find it either. It is generated dynamically and can only be seen if you crack open a developer console or firebug. They also only contain some styling and a react ids so not sure what you expect to pull from it even if you did get them.
If you want to get the html that you see in the browser, you will need something like selenium
First of all, note that class is a very special multi-valued attribute and it is a common source of confusion in BeautifulSoup.
html.find_all("div", {'class':['post', 'item']})
This would find all div elements that have either post class or item class (or both, of course). This may produce extra results you don't want to see, assuming you are after div elements with strictly class="post item". If this is the case, you can use a CSS selector:
html.select('div[class="post item"]')
There is also some more information in a similar thread:
BeautifulSoup returns empty list when searching by compound class names