i try to scrape some informations from a webpage and on the one page it is working fine, but on the other webpage it is not working cause i only get a none return-value
This code / webpage is working fine:
# https://realpython.com/beautiful-soup-web-scraper-python/
import requests
from bs4 import BeautifulSoup
URL = "https://www.monster.at/jobs/suche/?q=Software-Devel&where=Graz"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
name_box = soup.findAll("div", attrs={"class": "company"})
print (name_box)
But with this code / webpage i only get a None as return-value
# https://www.freecodecamp.org/news/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe/
import requests
from bs4 import BeautifulSoup
URL = "https://www.bloomberg.com/quote/SPX:IND"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
name_box = soup.find("h1", attrs={"class": "companyName__99a4824b"})
print (name_box)
Why is that?
(at first i thought due the number in the class on the second webpage "companyName__99a4824b" it changes the classname dynamicly - but this is not the case - when i refresh the webpage it is still the same classname...)
The reason you get None is that the Bloomberg page uses Javascript to load its content while the user is on the page.
BeautifulSoup simply returns to you the html of the page as found as soon as it reaches the page -- which does not contain the companyName_99a4824b class-tag.
Only after the user has waited for the page to fully load does the html include the desired tag.
If you want to scrape that data, you'll need to use something like Selenium, which you can instruct to wait until the desired element of the page is ready.
The website blocks scrapers, check the title:
print(soup.find("title"))
To bypass this you must use a real browser which can run JavaScript.
A tool called Selenium can do that for you.
Related
This is the website I'm trying to scrape with Python:
https://www.ebay.de/sch/i.html?_from=R40&_nkw=iphone+8&_sacat=0&LH_Sold=1&LH_Complete=1&rt=nc&LH_ItemCondition=3000
I want to access the 'ul' element with the class of 'srp-results srp-list clearfix'. This is what I tried with requests and BeautifulSoup:
from bs4 import BeautifulSoup
import requests
url = 'https://www.ebay.de/sch/i.html?_from=R40&_nkw=iphone+8&_sacat=0&LH_Sold=1&LH_Complete=1&rt=nc&LH_ItemCondition=3000'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
uls = soup.find_all('ul', attrs = {'class': 'srp-results srp-list clearfix'})
And the output is always an empty string.
I also tried scraping the website with Selenium Webdriver and I got the same result.
First I was a little bit confused about your error but after a bit of debugging I figured out that: eBay dynamically generates that ul with JavaScript
So since you can't execute JavaScript with BeautifulSoup you have to use selenium and wait until the JavaScript loads that ul
It is probably because the content you are looking for is rendered by JavaScript After the page loads on a web browser this means that the web browser load that content after running javascript which you cannot get with requests.get request from python.
I would suggest to learn Selenium to Scrape the data you want
I want to scrape information from this page.
Specifically, I want to scrape the table which appears when you click "View all" under the "TOP 10 HOLDINGS" (you have to scroll down on the page a bit).
I am new to webscraping, and have tried using BeautifulSoup to do this. However, there seems to be an issue because the "onclick" function I need to take into account. In other words: The HTML code I scrape directly from the page doesn't include the table I want to obtain.
I am a bit confused about my next step: should I use something like selenium or can I deal with the issue in an easier/more efficient way?
Thanks.
My current code:
from bs4 import BeautifulSoup
import requests
Soup = BeautifulSoup
my_url = 'http://www.etf.com/SHE'
page = requests.get(my_url)
htmltxt = page.text
soup = Soup(htmltxt, "html.parser")
print(soup)
You can get a json response from the api: http://www.etf.com/view_all/holdings/SHE. The table you're looking for is located in 'view_all'.
import requests
from bs4 import BeautifulSoup as Soup
url = 'http://www.etf.com/SHE'
api = "http://www.etf.com/view_all/holdings/SHE"
headers = {'X-Requested-With':'XMLHttpRequest', 'Referer':url}
page = requests.get(api, headers=headers)
htmltxt = page.json()['view_all']
soup = Soup(htmltxt, "html.parser")
data = [[td.text for td in tr.find_all('td')] for tr in soup.find_all('tr')]
print('\n'.join(': '.join(row) for row in data))
I am trying to scrape website, but I encountered a problem. When I try to scrape data, it looks like the html differs from what I see on google inspect and from what I get from python. I get this with http://edition.cnn.com/election/results/states/arizona/house/01 I tried to scrape election results. I used this script to check HTML part of the webpage, and I noticed that they different. There is no classes that I need, like section-wrapper.
page =requests.get('http://edition.cnn.com/election/results/states/arizona/house/01')
soup = BeautifulSoup(page.content, "lxml")
print(soup)
Anyone knows what is the problem ?
http://data.cnn.com/ELECTION/2016/AZ/county/H_d1_county.json
This site use JavaScript fetch data, you can check the url above.
You can find this url in chrome dev-tools, there are many links, check it out
Chrome >>F12>> network tab>>F5(refresh page)>>double click the .josn url>> open new tab
import requests
from bs4 import BeautifulSoup
page=requests.get('http://edition.cnn.com/election/results/states/arizona/house/01')
soup = BeautifulSoup(page.content)
#you can try all sorts of tags here I used class: "ad" and class:"ec-placeholder"
g_data = soup.find_all("div", {"class":"ec-placeholder"})
h_data = soup.find_all("div"),{"class":"ad"}
for item in g_data:print item
#print '\n'
#for item in h_data:print item
For an extra curricular school project, I'm learning how to scrape a website. As you can see by the code below, I am able to scrape a form called, 'elqFormRow' off of one page.
How would one go about scraping all occurrences of the 'elqFormRow' on the whole website? I'd like to return the URL of where that form was located into a list, but am running into trouble while doing so because I don't know how lol.
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('http://engage.hpe.com/Template_NGN_Convert_EG-SW_Combined_TEALIUM-RegPage').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
for div in soup.find_all('div', class_='elqFormRow'):
print(div.text.strip())
You can grab the URLs from a page and follow them to (presumably) scrape the whole site. Something like this, which will require a little massaging depending on where you want to start and what pages you want:
import bs4 as bs
import requests
domain = "engage.hpe.com"
initial_url = 'http://engage.hpe.com/Template_NGN_Convert_EG-SW_Combined_TEALIUM-RegPage'
# get urls to scrape
text = requests.get(initial_url).text
initial_soup = bs.BeautifulSoup(text, 'lxml')
tags = initial_soup.findAll('a', href=True)
urls = []
for tag in tags:
if domain in tag:
urls.append(tag['href'])
urls.append(initial_url)
print(urls)
# function to grab your info
def scrape_desired_info(url):
out = []
text = requests.get(url).text
soup = bs.BeautifulSoup(text, 'lxml')
for div in soup.find_all('div', class_='elqFormRow'):
out.append(div.text.strip())
return out
info = [scrape_desired_info(url) for url in urls if domain in url]
URLlib stinks, use requests. If you need to go multiple levels down in the site put the URL finding section in a function and call it X number of times, where X is the number of levels of links you want to traverse.
Scrape responsibly. Try not to get into a sorcerer's apprentice situation where you're hitting the site over and over in a loop, or following links external to the site. In general, I'd also not put in the question the page you want to scrape.
I am tring to work with web-Crawler in python to print the number of facebook recommenders. for example in this article from sky-news(http://news.sky.com/story/1330046/are-putins-little-green-men-back-in-ukraine)
there are about 60 facebook reccomends. I want to print this number in the python program with web-crawler.
i tried to do this, but it doesn't print anything:
import requests
from bs4 import BeautifulSoup
def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
# if you want to gather information from that page
for item_name in soup.findAll('span', {'class': 'pluginCountTextDisconnected'}):
try:
print(item_name.string)
except:
print("error")
get_single_item_data("http://news.sky.com/story/1330046/are-putins-little-green-men-back-in-ukraine")
The Facebook recommends loads in an iframe. You can follow the iframe src attribute to that page, and then load the span.pluginCountTextDisconnected's text:
import requests
from bs4 import BeautifulSoup
url = 'http://news.sky.com/story/1330046/are-putins-little-green-men-back-in-ukraine'
r = requests.get(url) # get the page through requests
soup = BeautifulSoup(r.text) # create a BeautifulSoup object from the page's HTML
url = soup('iframe')[0]['src'] # search for the iframe element and get its src attribute
r = requests.get('http://' + url[2:]) # get the next page from requests with the iframe URL
soup = BeautifulSoup(r.text) # create another BeautifulSoup object
print(soup.find('span', class_='pluginCountTextDisconnected').string) # get the directed information
The second requests.get is written as such due to the src attribute returning //www.facebook.com/plugins/like.php?href=http%3A%2F%2Fnews.sky.com%2Fstory%2F1330046&send=false&layout=button_count&width=120&show_faces=false&action=recommend&colorscheme=light&font=arial&height=21. I added the http:// and ignored the leading //.
BeautifulSoup documentation
Requests documentation
Facebook recommends are loaded dynamically from javascript, so they won't be available to your HTML parser. You will need to use the Graph API and FQL to get your answer directly from Facebook.
Here is a web console where you can explore queries once you have generated an access token.