I tried using beautiful soup to parse a website, however when I printed "page_soup" I would only get a portion of the HTML, the beginning portion of the code, which has the info I need, was omitted. No one answered my question. After doing some research I tried using Selenium to access the full HTML, however I got the same results. Below are both of my attempts with selenium and beautiful soup. When I try and print the html it starts off in the middle of the source code, skipping the doctype, lang etc initial statements.
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Chrome( executable_path= "/usr/local/bin/chromedriver")
browser.get('https://coronavirusbellcurve.com/')
html = browser.page_source
soup = BeautifulSoup(html)
print(soup)
import bs4
import urllib
from urllib.request import urlopen as uReq
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
htmlPage = urlopen(pageRequest).read()
page_soup = soup(htmlPage, 'html.parser')
print(page_soup)
The requests module seems to be returning the numbers in the first table on the page assuming you are referring to US Totals.
import requests
r = requests.get('https://coronavirusbellcurve.com/').content
print(r)
Related
I'm trying to scrape the link to the image on this reddit website for practice, but BS4 seems to be returning none whenever I use find() to find the class of the object. Any help?
from bs4 import BeautifulSoup as soup
page = requests.get("https://www.reddit.com/r/wallpaper/comments/qswblq/the_frontier_by_me_5120x2880/")
soup = soup(page.content, "html.parser")
print(soup.find(class_="ImageBox-image")['src'])
As mentioned in the comments there is an alternativ, you can use selenium.
Instead of requests it will render the site like a browser and will give page_source you could inspect and find your element.
Example:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome('YOUR PATH TO CHROMEDIVER')
driver.get('https://www.reddit.com/r/wallpaper/comments/qswblq/the_frontier_by_me_5120x2880/')
content = driver.page_source
soup = BeautifulSoup(content,'html.parser')
soup.find(class_="ImageBox-image")['src']
I am trying to get the comments from a website called Seesaw but the output has no length. What am I doing wrong?
import requests
import requests
import base64
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as req
from requests import get
html_text = requests.get("https://app.seesaw.me/#/activities/class/class.93a29acf-0eef-4d4e-9d56-9648d2623171").text
soup = BeautifulSoup(html_text, "lxml")
comments = soup.find_all("span", class_ = "ng-binding")
print(comments)
Because there is no span element with class ng-binding on the page (these elements added later via JavaScript)
import requests
html_text = requests.get("https://app.seesaw.me/#/activities/class/class.93a29acf-0eef-4d4e-9d56-9648d2623171").text
print(f'{"ng-binding" in html_text=}')
So output is:
"ng-binding" in html_text=False
Also you can check it using "View Page Source" function in your browser. You can try to use Selenium for automate interaction with the site.
So i started learning web scraping in python using urllib and bs4,
I was searching for a code to analyze and i found this:-
https://stackoverflow.com/a/38620894/14252018
here is the code:-
from urllib.parse import urlencode, urlparse, parse_qs
from lxml.html import fromstring
from requests import get
raw = get("https://www.google.com/search?q=StackOverflow").text
page = fromstring(raw)
for result in page.cssselect(".r a"):
url = result.get("href")
if url.startswith("/url?"):
url = parse_qs(urlparse(url).query)['q']
print(url[0])
When i try to run this it does not print anything
So then i tried using bs4 and this time i chose https://www.duckduckgo.com
and changed the code to this:-
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('https://duckduckgo.com/?q=dinosaur&t=h_&ia=web').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
print(soup.get_text())
I got an error:-
Why didn't the first block of code run?
why did the second block of code gave me an error? and what does that error mean?
Change your duckduckgo URL to where the site tries to redirect you when javascript is not enabled.
import bs4 as bs
import urllib.request
# url = 'https://duckduckgo.com/?q=dinosaur&t=h_&ia=web' # uses javascript
url = 'https://html.duckduckgo.com/html?q=dinosaur' # no javascript
sauce = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(sauce, 'lxml')
print(soup.get_text())
I want to scrape the Google translate website and get the translated text from it using Python 3.
Here is my code:
from bs4 import BeautifulSoup as soup
from urllib.request import Request as uReq
from urllib.request import urlopen as open
my_url = "https://translate.google.com/#en/es/I%20am%20Animikh%20Aich"
req = uReq(my_url, headers={'User-Agent':'Mozilla/5.0'})
uClient = open(req)
page_html = uClient.read()
uClient.close()
html = soup(page_html, 'html5lib')
print(html)
Unfortunately, I am unable to find the required information in the parsed Webpage.
In chrome "Inspect", It is showing that the translated text is inside:
<span id="result_box" class="short_text" lang="es"><span class="">Yo soy Animikh Aich</span></span>
However, When I am searching for the information in the parsed HTML code, this is what I'm finding in it:
<span class="short_text" id="result_box"></span>
I have tried parsing using all of html5lib, lxml, html.parser. I have not been able to find a solution for this.
Please help me with the issue.
you could use a specific python api:
import goslate
gs = goslate.Goslate()
print(gs.translate('I am Animikh Aich', 'es'))
Yo soy Animikh Aich
Try like below to get the desired content:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://translate.google.com/#en/es/I%20am%20Animikh%20Aich")
soup = BeautifulSoup(driver.page_source, 'html5lib')
item = soup.select_one("#result_box span").text
print(item)
driver.quit()
Output:
Yo soy Animikh Aich
JavaScript is modifying the HTML code after it loads. urllib can't handle JavaScript, you'll have to use Selenium to get the data that you want.
For installation and demo, refer this link.
I'm trying to capture the number of visits on this page, but python returns the tag with no text.
This is what I've done.
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.kijiji.ca/v-2-bedroom-apartments-condos/city-of-halifax/clayton-park-west-condo-style-luxury-2-bed-den/1016364514")
soup = BeautifulSoup(r.content)
print soup.find_all("span",{"class":"ad-visits"})
The values you are trying to scrape are populated by javascript so beautfulsoup or requests aren't going to work in this case.
You'll need to use something like selenium to get the output.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.kijiji.ca/v-2-bedroom-apartments-condos/city-of-halifax/clayton-park-west-condo-style-luxury-2-bed-den/1016364514")
soup = BeautifulSoup(driver.page_source , 'html.parser')
print soup.find_all("span",{"class":"ad-visits"})
Selenium will return the page source as rendered and you can then use beautifulsoup to get the value
[<span class="ad-visits">385</span>]