So i am trying to extract the text in the grand-final section (the winner team name)
https://i.stack.imgur.com/4QPqI.png
my problem is that the text that im looking to extract isnt found by soup, it only finds up to (class="sgg2h1cC DEPRECATED_bootstrap_container undefined native-scroll dragscroll") but as you can see here:
https://i.imgur.com/Brmv6ba.png there is more.
here is my code, can someone explain how i would get the info im looking for? also im pretty new to webscraping
from bs4 import BeautifulSoup
URL = 'https://smash.gg/tournament/revolve-oceania-2v2-finale/event/revolve-oceania-2v2-finale-event/brackets/841267/1343704'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id="app_feature_canvas")
a = results.find_all('div', class_="regionWrapper-APP_TOURNAMENT_PAGE-FeatureCanvas")
print()
for b in a:
c = b.find('div', class_="page-section page-section-grey")
print(c)
What you see in your inspector is not the same as what you get when you use requests. Instead of using the dev console, view the page source.
Those parts of the page are generated by JavaScript, thus, will not appear when you request the page via requests.
URL = 'https://smash.gg/tournament/revolve-oceania-2v2-finale/event/revolve-oceania-2v2-finale-event/brackets/841267/1343704'
page = requests.get(URL)
print(page.text) # notice this is nothing like what you see in the inspector
To get javascript execution, consider using selenium instead of requests.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get(URL)
html = driver.page_source # DOM with JavaScript execution complete
soup = BeautifulSoup(html)
# ... go from here
Alternatively, there may be enough information in the page source to get what you're looking for. Notice there's a lot of JSON in the page source with various info that, presumably, may be used by the JS to populate those elements.
Alternatively still, you can also copy/paste from the DOM browser in your inspector. (right-click the html element and click "copy outer html")
html = pyperclip.paste() # put contents of the clipboard into a variable
soup = BeautifulSoup(html)
results = soup.find(id="app_feature_canvas")
a = results.find_all('div', class_="regionWrapper-APP_TOURNAMENT_PAGE-FeatureCanvas")
print()
for b in a:
c = b.find('div', class_="page-section page-section-grey")
print(c)
And this works :-)
Related
i try to scrape some informations from a webpage and on the one page it is working fine, but on the other webpage it is not working cause i only get a none return-value
This code / webpage is working fine:
# https://realpython.com/beautiful-soup-web-scraper-python/
import requests
from bs4 import BeautifulSoup
URL = "https://www.monster.at/jobs/suche/?q=Software-Devel&where=Graz"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
name_box = soup.findAll("div", attrs={"class": "company"})
print (name_box)
But with this code / webpage i only get a None as return-value
# https://www.freecodecamp.org/news/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe/
import requests
from bs4 import BeautifulSoup
URL = "https://www.bloomberg.com/quote/SPX:IND"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
name_box = soup.find("h1", attrs={"class": "companyName__99a4824b"})
print (name_box)
Why is that?
(at first i thought due the number in the class on the second webpage "companyName__99a4824b" it changes the classname dynamicly - but this is not the case - when i refresh the webpage it is still the same classname...)
The reason you get None is that the Bloomberg page uses Javascript to load its content while the user is on the page.
BeautifulSoup simply returns to you the html of the page as found as soon as it reaches the page -- which does not contain the companyName_99a4824b class-tag.
Only after the user has waited for the page to fully load does the html include the desired tag.
If you want to scrape that data, you'll need to use something like Selenium, which you can instruct to wait until the desired element of the page is ready.
The website blocks scrapers, check the title:
print(soup.find("title"))
To bypass this you must use a real browser which can run JavaScript.
A tool called Selenium can do that for you.
I’m trying to scrape all the file paths from links like this: https://github.com/themichaelusa/Trinitum/find/master, without using the GitHub API at all.
The link above contains a data-url attribute in the HTML (table, id=‘tree-finder-results’, class=‘tree-browser css-truncate’), which is used to make a URL like this: https://github.com/themichaelusa/Trinitum/tree-list/45a2ca7145369bee6c31a54c30fca8d3f0aae6cd
which displays this dictionary:
{"paths":["Examples/advanced_example.py","Examples/basic_example.py","LICENSE","README.md","Trinitum/AsyncManager.py","Trinitum/Constants.py","Trinitum/DatabaseManager.py","Trinitum/Diagnostics.py","Trinitum/Order.py","Trinitum/Pipeline.py","Trinitum/Position.py","Trinitum/RSU.py","Trinitum/Strategy.py","Trinitum/TradingInstance.py","Trinitum/Trinitum.py","Trinitum/Utilities.py","Trinitum/__init__.py","setup.cfg","setup.py"]}
when you view it in a browser like Chrome. However, GET request yields a <[400] Response>.
Here is the code I used:
username, repo = ‘themichaelusa’, ‘Trinitum’
ghURL = 'https://github.com'
url = ghURL + ('/{}/{}/find/master'.format(self.username, repo))
html = requests.get(url)
soup = BeautifulSoup(html.text, "lxml")
repoContent = soup.find('div', class_='tree-finder clearfix')
fileLinksURL = ghURL + str(repoContent.find('table').attrs['data-url'])
filePaths = requests.get(fileLinksURL)
print(filePaths)
Not sure what is wrong with it. My theory is that the first link creates a cookie that allows the second link to show the file paths of the repo we are targeting. I'm just unsure how to achieve this via code. Would really appreciate some pointers!
Give it a go. The links containing .py files are generated dynamically, so to catch them you need to use selenium. I think this is what you expected.
from selenium import webdriver ; from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = 'https://github.com/themichaelusa/Trinitum/find/master'
driver=webdriver.Chrome()
driver.get(url)
soup = BeautifulSoup(driver.page_source, "lxml")
driver.quit()
for link in soup.select('#tree-finder-results .js-tree-finder-path'):
print(urljoin(url,link['href']))
Partial results:
https://github.com/themichaelusa/Trinitum/blob/master
https://github.com/themichaelusa/Trinitum/blob/master/Examples/advanced_example.py
https://github.com/themichaelusa/Trinitum/blob/master/Examples/basic_example.py
https://github.com/themichaelusa/Trinitum/blob/master/LICENSE
https://github.com/themichaelusa/Trinitum/blob/master/README.md
https://github.com/themichaelusa/Trinitum/blob/master/Trinitum/AsyncManager.py
I am trying to scrape website, but I encountered a problem. When I try to scrape data, it looks like the html differs from what I see on google inspect and from what I get from python. I get this with http://edition.cnn.com/election/results/states/arizona/house/01 I tried to scrape election results. I used this script to check HTML part of the webpage, and I noticed that they different. There is no classes that I need, like section-wrapper.
page =requests.get('http://edition.cnn.com/election/results/states/arizona/house/01')
soup = BeautifulSoup(page.content, "lxml")
print(soup)
Anyone knows what is the problem ?
http://data.cnn.com/ELECTION/2016/AZ/county/H_d1_county.json
This site use JavaScript fetch data, you can check the url above.
You can find this url in chrome dev-tools, there are many links, check it out
Chrome >>F12>> network tab>>F5(refresh page)>>double click the .josn url>> open new tab
import requests
from bs4 import BeautifulSoup
page=requests.get('http://edition.cnn.com/election/results/states/arizona/house/01')
soup = BeautifulSoup(page.content)
#you can try all sorts of tags here I used class: "ad" and class:"ec-placeholder"
g_data = soup.find_all("div", {"class":"ec-placeholder"})
h_data = soup.find_all("div"),{"class":"ad"}
for item in g_data:print item
#print '\n'
#for item in h_data:print item
[What I'm trying to do]
Scrape the webpage below for used car data.
http://www.goo-net.com/php/search/summary.php?price_range=&pref_c=08,09,10,11,12,13,14&easysearch_flg=1
[Issue]
To scrape the entire pages. In the url above, only first 30 items are shown. Those could be scraped by the code below which I wrote. Links to other pages are displayed like 1 2 3... but the link addresses seems to be in Javascript. I googled for useful information but couldn't find any.
from bs4 import BeautifulSoup
import urllib.request
html = urllib.request.urlopen("http://www.goo-net.com/php/search/summary.php?price_range=&pref_c=08,09,10,11,12,13,14&easysearch_flg=1")
soup = BeautifulSoup(html, "lxml")
total_cars = soup.find(class_="change change_01").find('em').string
tmp = soup.find(class_="change change_01").find_all('span')
car_start, car_end = tmp[0].string, tmp[1].string
# get urls to car detail pages
car_urls = []
heading_inners = soup.find_all(class_="heading_inner")
for heading_inner in heading_inners:
href = heading_inner.find('h4').find('a').get('href')
car_urls.append('http://www.goo-net.com' + href)
for url in car_urls:
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html, "lxml")
#title
print(soup.find(class_='hdBlockTop').find('p', class_='tit').string)
#price of car itself
print(soup.find(class_='price1').string)
#price of car including tax
print(soup.find(class_='price2').string)
tds = soup.find(class_='subData').find_all('td')
# year
print(tds[0].string)
# distance
print(tds[1].string)
# displacement
print(tds[2].string)
# inspection
print(tds[3].string)
[What I'd like to know]
How to scrape the entire pages. I prefer to use BeautifulSoup4 (Python). But if that is not the appropriate tool, please show me other ones.
[My environment]
Windows 8.1
Python 3.5
PyDev (Eclipse)
BeautifulSoup4
Any guidance would be appreciated. Thank you.
you can use selenium like below sample:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://example.com')
element = driver.find_element_by_class_name("yourClassName") #or find by text or etc
element.click()
The python module splinter may be a good starting point. It calls an external browser (such as Firefox) and access the browser's DOM rather than dealing with HTML only.
I was trying to do web scraping and was using the following code :
import mechanize
from bs4 import BeautifulSoup
url = "http://www.thehindu.com/archive/web/2010/06/19/"
br = mechanize.Browser()
htmltext = br.open(url).read()
link_dictionary = {}
soup = BeautifulSoup(htmltext)
for tag_li in soup.findAll('li', attrs={"data-section":"Chennai"}):
for link in tag_li.findAll('a'):
link_dictionary[link.string] = link.get('href')
print link_dictionary[link.string]
urlnew = link_dictionary[link.string]
brnew = mechanize.Browser()
htmltextnew = brnew.open(urlnew).read()
articletext = ""
soupnew = BeautifulSoup(htmltextnew)
for tag in soupnew.findAll('p'):
articletext += tag.text
print articletext
I was unable to get any printed values by using this. But on using attrs={"data-section":"Business"} instead of attrs={"data-section":"Chennai"} I was able to get the desired output. Can someone help me?
READ THE TERMS OF SERVICES OF THE WEBSITE BEFORE SCRAPING
If you are using firebug or inspect element in Chrome, you might see some contents that will not be seen if you are using Mechanize or Urllib2.
For example, when you view the source code of the page sent out by you. (Right click view source in Chrome). and search for data-section tag, you won't see any tags which chennai, I am not 100% sure but I will say those contents need to be populated by Javascript ..etc. which requires the functionality of a browser.
If I were you, I will use selenium to open up the page and then get the source page from there, then the HTML collected in that way will be more like what you see in a browser.
Cited here
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Firefox()
driver.get("URL GOES HERE")
# I noticed there is an ad here, sleep til page fully loaded.
time.sleep(10)
soup = BeautifulSoup(driver.page_source)
print len(soup.findAll(...}))
# or you can work directly in selenium
...
driver.close()
And the output for me is 8