So i started learning web scraping in python using urllib and bs4,
I was searching for a code to analyze and i found this:-
https://stackoverflow.com/a/38620894/14252018
here is the code:-
from urllib.parse import urlencode, urlparse, parse_qs
from lxml.html import fromstring
from requests import get
raw = get("https://www.google.com/search?q=StackOverflow").text
page = fromstring(raw)
for result in page.cssselect(".r a"):
url = result.get("href")
if url.startswith("/url?"):
url = parse_qs(urlparse(url).query)['q']
print(url[0])
When i try to run this it does not print anything
So then i tried using bs4 and this time i chose https://www.duckduckgo.com
and changed the code to this:-
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('https://duckduckgo.com/?q=dinosaur&t=h_&ia=web').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
print(soup.get_text())
I got an error:-
Why didn't the first block of code run?
why did the second block of code gave me an error? and what does that error mean?
Change your duckduckgo URL to where the site tries to redirect you when javascript is not enabled.
import bs4 as bs
import urllib.request
# url = 'https://duckduckgo.com/?q=dinosaur&t=h_&ia=web' # uses javascript
url = 'https://html.duckduckgo.com/html?q=dinosaur' # no javascript
sauce = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(sauce, 'lxml')
print(soup.get_text())
Related
I am trying to get the comments from a website called Seesaw but the output has no length. What am I doing wrong?
import requests
import requests
import base64
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as req
from requests import get
html_text = requests.get("https://app.seesaw.me/#/activities/class/class.93a29acf-0eef-4d4e-9d56-9648d2623171").text
soup = BeautifulSoup(html_text, "lxml")
comments = soup.find_all("span", class_ = "ng-binding")
print(comments)
Because there is no span element with class ng-binding on the page (these elements added later via JavaScript)
import requests
html_text = requests.get("https://app.seesaw.me/#/activities/class/class.93a29acf-0eef-4d4e-9d56-9648d2623171").text
print(f'{"ng-binding" in html_text=}')
So output is:
"ng-binding" in html_text=False
Also you can check it using "View Page Source" function in your browser. You can try to use Selenium for automate interaction with the site.
I tried using beautiful soup to parse a website, however when I printed "page_soup" I would only get a portion of the HTML, the beginning portion of the code, which has the info I need, was omitted. No one answered my question. After doing some research I tried using Selenium to access the full HTML, however I got the same results. Below are both of my attempts with selenium and beautiful soup. When I try and print the html it starts off in the middle of the source code, skipping the doctype, lang etc initial statements.
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Chrome( executable_path= "/usr/local/bin/chromedriver")
browser.get('https://coronavirusbellcurve.com/')
html = browser.page_source
soup = BeautifulSoup(html)
print(soup)
import bs4
import urllib
from urllib.request import urlopen as uReq
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
htmlPage = urlopen(pageRequest).read()
page_soup = soup(htmlPage, 'html.parser')
print(page_soup)
The requests module seems to be returning the numbers in the first table on the page assuming you are referring to US Totals.
import requests
r = requests.get('https://coronavirusbellcurve.com/').content
print(r)
I am trying to get some data from a URL by using BeautifulSoup in python but when I run the last command,
soup = BeautifulSoup(content)
I consistently get this error telling me that 'LXMLTreeBuilder' object has no attribute 'DEFAULT_NSMAPS_INVERTED'
How do I go about this problem?
Here is my code :
import urllib.request as urllib2
from bs4 import BeautifulSoup
import requests
url = 'https://www.ucf.edu/'
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)
You imported requests so use it... Try it this way:
url = 'https://www.ucf.edu/'
page = requests.get(url)
soup = BeautifulSoup(page.content)
You don't specify parser in BeautifulSoup constructor. Try put html.parser there:
import urllib.request as urllib2
from bs4 import BeautifulSoup
import requests
url = 'https://www.ucf.edu/'
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content, 'html.parser') # <-- specify parser here
print(soup.prettify())
EDIT: Make sure you have latest version of BeautifulSoup installed (optionally latest version of lxml). I'm on version beautifulsoup4==4.8.0 and lxml==4.3.4
I have a web-page and I want to get the <div class="password"> element using urllbi2 in Python without using Beautiful Soup.
My code so far:
import urllib.request as urllib2
link = "http://www.chiquitooenterprise.com/password"
response = urllib2.urlopen('http://www.chiquitooenterprise.com/')
contents = response.read('password')
It gives an error.
You need to decode() the response with utf-8 as it states in the Network tab:
Hence:
import urllib.request as urllib2
link = "http://www.chiquitooenterprise.com/password"
response = urllib2.urlopen('http://www.chiquitooenterprise.com/')
output = response.read().decode('utf-8')
print(output)
OUTPUT:
YOIYEDGXPU
You don't want bs4 you say but you could use requests
import requests
r = requests.get('http://www.chiquitooenterprise.com/password')
print(r.text)
I am working on python. I am learning beautifulsoup & I am parsing a link.
my url :
http://www.dtemaharashtra.gov.in/approvedinstitues/StaticPages/frmInstituteSummary.aspx?InstituteCode=1002
I want to parse email id from that url.
How can I do that?
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('http://www.dtemaharashtra.gov.in/approvedinstitues/StaticPages/frmInstituteSummary.aspx?InstituteCode=1002').read()
soup = BeautifulSoup(html)
print soup.find(id='ctl00_rightContainer_ContentBox1_lblEMailAddress').text
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.dtemaharashtra.gov.in/approvedinstitues/StaticPages/frmInstituteSummary.aspx?InstituteCode=1002")
soup = BeautifulSoup(r.text)
soup.find("span", {"id":"ctl00_rightContainer_ContentBox1_lblEMailAddress"}).text