When I try to scrape data with the following code:
from bs4 import BeautifulSoup
import urllib2
import requests
import re
start_url = requests.get('http://www.indiaproperty.com/chennai-property-search-allresidential-properties-for-sale-in-velachery-results')
soup = BeautifulSoup(start_url.content)
properties = soup.findAll('a',{'class':'paddl10'})
for eachproperty in properties:
print eachproperty['href']
I do not see any output. It does not give any error also. I checked if it is getting redirected using following code
from bs4 import BeautifulSoup
import requests
response = requests.get('http://www.indiaproperty.com/chennai-property-search-allresidential-properties-for-sale-in-velachery-results')
response.url
Then it also shows nothing and no error either.
Related
I am trying to get the comments from a website called Seesaw but the output has no length. What am I doing wrong?
import requests
import requests
import base64
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as req
from requests import get
html_text = requests.get("https://app.seesaw.me/#/activities/class/class.93a29acf-0eef-4d4e-9d56-9648d2623171").text
soup = BeautifulSoup(html_text, "lxml")
comments = soup.find_all("span", class_ = "ng-binding")
print(comments)
Because there is no span element with class ng-binding on the page (these elements added later via JavaScript)
import requests
html_text = requests.get("https://app.seesaw.me/#/activities/class/class.93a29acf-0eef-4d4e-9d56-9648d2623171").text
print(f'{"ng-binding" in html_text=}')
So output is:
"ng-binding" in html_text=False
Also you can check it using "View Page Source" function in your browser. You can try to use Selenium for automate interaction with the site.
For some reason the terminal keeps returning nothing, not even a value of none. I would appreciate some enlightenment on the matter.
import requests
import json
import time
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
def monitor():
source = requests.get ('https://www.topps.com')
soup = BeautifulSoup(source.content,'lxml')
for hrefs in soup.find_all('figure', class_='product-item-link'):
url = "https://www.tops.com/cards-collectibles/topps-now" + hrefs.a.get('href')
print(url)
monitor()
So i started learning web scraping in python using urllib and bs4,
I was searching for a code to analyze and i found this:-
https://stackoverflow.com/a/38620894/14252018
here is the code:-
from urllib.parse import urlencode, urlparse, parse_qs
from lxml.html import fromstring
from requests import get
raw = get("https://www.google.com/search?q=StackOverflow").text
page = fromstring(raw)
for result in page.cssselect(".r a"):
url = result.get("href")
if url.startswith("/url?"):
url = parse_qs(urlparse(url).query)['q']
print(url[0])
When i try to run this it does not print anything
So then i tried using bs4 and this time i chose https://www.duckduckgo.com
and changed the code to this:-
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('https://duckduckgo.com/?q=dinosaur&t=h_&ia=web').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
print(soup.get_text())
I got an error:-
Why didn't the first block of code run?
why did the second block of code gave me an error? and what does that error mean?
Change your duckduckgo URL to where the site tries to redirect you when javascript is not enabled.
import bs4 as bs
import urllib.request
# url = 'https://duckduckgo.com/?q=dinosaur&t=h_&ia=web' # uses javascript
url = 'https://html.duckduckgo.com/html?q=dinosaur' # no javascript
sauce = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(sauce, 'lxml')
print(soup.get_text())
I have a web-page and I want to get the <div class="password"> element using urllbi2 in Python without using Beautiful Soup.
My code so far:
import urllib.request as urllib2
link = "http://www.chiquitooenterprise.com/password"
response = urllib2.urlopen('http://www.chiquitooenterprise.com/')
contents = response.read('password')
It gives an error.
You need to decode() the response with utf-8 as it states in the Network tab:
Hence:
import urllib.request as urllib2
link = "http://www.chiquitooenterprise.com/password"
response = urllib2.urlopen('http://www.chiquitooenterprise.com/')
output = response.read().decode('utf-8')
print(output)
OUTPUT:
YOIYEDGXPU
You don't want bs4 you say but you could use requests
import requests
r = requests.get('http://www.chiquitooenterprise.com/password')
print(r.text)
I've been trying to get this to work, but keep getting the same TypeError object has no len(). The BeautifulSoup documentation hasn't been any help. This seems to work on every tutorial I watch, and read, but not for me. What am I doing wrong?
import requests
from bs4 import BeautifulSoup
http = requests.get("https://www.imdb.com/title/tt0366627/?ref_=nv_sr_1")
print(http)
This returns Response [200], but if I try to add soup... I get the len error:
import requests
from bs4 import BeautifulSoup
http = requests.get("https://www.imdb.com/title/tt0366627/?ref_=nv_sr_1")
soup = BeautifulSoup(http, 'lxml')
print(soup)
As the docs say:
To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:
A Response object is neither a string nor an open filehandle.
The simplest way to get one of the two, as shown in the first example in the requests docs, is the .text attribute. So:
http = requests.get("https://www.imdb.com/title/tt6738136/?ref_=inth_ov_tt")
soup = BeautifulSoup(http.text, 'lxml')
For other options see Response Content—e.g., you can get the bytes with .content to let BeautifulSoup guess at the encoding instead of reading it from the headers, or get the socket (which is an open filehandle) with .raw.
My final code. It just prints out the title, year and summary, which was all I wanted. Thank all of you for your help.
import requests
import lxml
from bs4 import BeautifulSoup
http = requests.get("https://www.imdb.com/title/tt0366627/?ref_=nv_sr_1")
soup = BeautifulSoup(http.content, 'lxml')
title = soup.find("div", class_="title_wrapper").find()
summary = soup.find(class_="summary_text")
print(title.text)
print(summary.text)
The Response-200 that you getting from the following code:
import requests
from bs4 import BeautifulSoup
http = requests.get("https://www.imdb.com/title/tt6738136/?ref_=inth_ov_tt")
print(http)
shows that the your request is succeeded and returned a response. In-order to parse the HTML code there are two ways:
Direct print the text/String format
import requests
from bs4 import BeautifulSoup
http = requests.get("https://www.imdb.com/title/tt6738136/?ref_=inth_ov_tt")
print(http.text)
Use a HTML parser
import requests
from bs4 import BeautifulSoup
http = requests.get("https://www.imdb.com/title/tt6738136/?ref_=inth_ov_tt")
soup = BeautifulSoup(http.text, 'lxml')
print(soup)
it is better to use BeautifulSoup as using this will allow you to extract the required data from HTML, in-case you need it