Webscraping using beautifulsoup - python

I am trying to scrape reviews from Imdb movies using python3.6. However when I print my 'review', only 1 review pops up and I am not sure why the rest does not pop up. This does not happen for my 'review_title'. Any advise or help is greatly appreciated as I've been searching forums and googling but no avail.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url = urlopen('http://www.imdb.com/title/tt0111161/reviews?ref_=tt_ov_rt').read()
soup = BeautifulSoup(url,"html.parser")
print(soup.prettify())
review_title = soup.find("div",attrs={"class":"lister"}).findAll("div",{"class":"title"})
review = soup.find("div",attrs={"class":"text"})
review = soup.find("div",attrs={"class":"text"}).findAll("div",{"class":"text"})
rating = soup.find("span",attrs={"class":"rating-other-user-rating"}).findAll("span")

Without creating any loop how can you reach all the content of that page? The way you have written your script is exactly doing what it is supposed to do (parsing the single review content).Try the below way instead. It will fetch you all the visible data.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = urlopen('http://www.imdb.com/title/tt0111161/reviews?ref_=tt_ov_rt').read()
soup = BeautifulSoup(url,"html.parser")
for item in soup.find_all(class_="review-container"):
review_title = item.find(class_="title").text
review = item.find(class_="text").text
try:
rating = item.find(class_="point-scale").previous_sibling.text
except:
rating = ""
print("Title: {}\nReview: {}\nRating: {}\n".format(review_title,review,rating))

Related

Trying to get the string inside of <div> tags using bs4 (python3)

Please be patient with me. Brand new to Python and Stackoverflow.
I am trying to pull crypto price data into a program in order find out exactly how much I have in usd. I am currently stuck trying to extract the string from the tag that I get back. What I have so far:
It won't allow me to add a picture of my post yet so here is a link:
[1]: https://i.stack.imgur.com/DVlxe.png
I will also put the code on here, Please forgive the formatting.
from bs4 import BeautifulSoup
import requests
url = ('https://coinmarketcap.com/currencies/shiba-inu/')
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
price = soup.find_all("div", {"class": "imn55z-0 hCqbVS price"})
for i in price:
prices = (i.find("div"))
print(prices)
I am wanting to pull the string out to turn it into an int to do some math equations on later in the program.
Any and all help will be much appreciated.
You don't need the whole class (which possibly might change), it should just work with price. Try the following:
from bs4 import BeautifulSoup
import requests
url = 'https://coinmarketcap.com/currencies/shiba-inu/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
div_price = soup.find('div', class_="price")
price = div_price.div.text
print(price)
This displayed:
$0.00001019

beautifulsoup returns none when using find() for an element

I'm trying to scrape this site to retrieve the years of each paper thats been published. I've managed to get titles to work but when it comes to scraping the years it returns none.
I've broken it down and the results of 'none' occur when its going into the for loop but I can't figure out why this happens when its worked with titles.
import requests
from bs4 import BeautifulSoup
URL = "https://dblp.org/search?q=web%20scraping"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(class_="publ-list")
paperResults = results.find_all(class_="data tts-content")
for singlepaper in paperResults:
paperyear = singlepaper.find(class_="datePublished")
print(paperyear)
When it goes to paperResults it gives the breakdown of the section I've selected within the results on the line above that.
Any suggestions on how to retrieve the years would be greatly appreciated
Change this
for singlepaper in paperResults:
paperyear = singlepaper.find(class_="datePublished")
print(paperyear)
To this
for singlepaper in paperResults:
paperyear = singlepaper.find('span', itemprop="datePublished")
print(paperyear.string)
You were looking for a class when you needed to be parsing span... if you print paperResults you will see that your datePublished is an itemprop in a span element.
Try this:
import requests
from bs4 import BeautifulSoup
URL = "https://dblp.org/search?q=web%20scraping"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(class_="publ-list")
paperResults = results.find_all(class_="data tts-content")
for singlepaper in paperResults:
paperyear = singlepaper.find(attrs={"itemprop": "datePublished"})
print(paperyear)
It worked for me.

How to scrape next page data as i do in the first page?

I have the following code:
from bs4 import BeautifulSoup
import requests
import csv
url = "https://coingecko.com/en"
base_url = "https://coingecko.com"
page = requests.get(url)
soup = BeautifulSoup(page.content,"html.parser")
names = [div.a.span.text for div in soup.find_all("div",attrs={"class":"coin-content center"})]
Link = [base_url+div.a["href"] for div in soup.find_all("div",attrs={"class":"coin-content center"})]
for link in Link:
inner_page = requests.get(link)
inner_soup = BeautifulSoup(inner_page.content,"html.parser")
indent = inner_soup.find("div",attrs={"class":"py-2"})
content = indent.div.next_siblings
Allcontent = [sibling for sibling in content if sibling.string is not None]
print(Allcontent)
I have successfully enter to innerpage and grabbed all coins' information from the first page listed coin. But there is next page as 1,2,3,4,5,6,7,8,9 etc. How can I go to all the next page and do the same as previously?
Further, the output of my code contains a lot of \n and space. How can I fix that.
You need to generate all the pages and requests one by one and parse using bs4
from bs4 import BeautifulSoup
import requests
req = requests.get('https://www.coingecko.com/en')
soup = BeautifulSoup(req.content, 'html.parser')
last_page = soup.select('ul.pagination li:nth-of-type(8) > a:nth-of-type(1)')[0]['href']
lp = last_page.split('=')[-1]
count = 0
for i in range(int(lp)):
count+=1
url = 'https://www.coingecko.com/en?page='+str(count)
print(url)
requests.get(url)#requests each page one by one till last page
##parse your fileds here using bs4
The way you have written your script has got a messy look. Try with .select() to make it concise and less prone to break. Although I could not find the further usage of names in your script, I kept it as it is. Here is how you can get all the available links traversing multiple pages.
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests
url = "https://coingecko.com/en"
while True:
page = requests.get(url)
soup = BeautifulSoup(page.text,"lxml")
names = [item.text for item in soup.select("span.d-lg-block")]
for link in [urljoin(url,item["href"]) for item in soup.select(".coin-content a")]:
inner_page = requests.get(link)
inner_soup = BeautifulSoup(inner_page.text,"lxml")
desc = [item.get_text(strip=True) for item in inner_soup.select(".py-2 p") if item.text]
print(desc)
try:
url = urljoin(url,soup.select_one(".pagination a[rel='next']")['href'])
except TypeError:break
Btw, whitespaces have also been taken care of by using .get_text(strip=True)

How to use python to press the “load more” in imdb to get more reviews

I am creating a web crawler now and I want to scrape the user reviews from imdb. It's easy to directly get the 10 reviews and rate from the origin page. For example http://www.imdb.com/title/tt1392170/reviews The problem is to get all reviews, I need to press the "load more" so that more reviews will be shown while the url address doesn't change! So I don't know how can I get all the reviews in Python3. What I use now are requests, bs4.
My code now:
from urllib.request import urlopen, urlretrieve
from bs4 import BeautifulSoup
url_link='http://www.imdb.com/title/tt0371746/reviews?ref_=tt_urv'
html=urlopen(url_link)
content_bs=BeautifulSoup(html)
for b in content_bs.find_all('div',class_='text'):
print(b)
for rate_score in content_bs.find_all('span',class_='rating-other-user-rating'):
print(rate_score)
You can't press the load more button without initiating click event. However, BeautifulSoup doesn't have that property. But, what you can do to get the full content is something like what i've demonstrated below. It will fetch you all the review title along with reviews:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = 'http://www.imdb.com/title/tt0371746/reviews?ref_=tt_urv'
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
main_content = urljoin(url,soup.select(".load-more-data")[0]['data-ajaxurl']) ##extracting the link leading to the page containing everything available here
response = requests.get(main_content)
broth = BeautifulSoup(response.text,"lxml")
for item in broth.select(".review-container"):
title = item.select(".title")[0].text
review = item.select(".text")[0].text
print("Title: {}\n\nReview: {}\n\n".format(title,review))

Scrape info of companies on Fortune 500

I am trying to scrape company info from http://fortune.com/fortune500 for my thesis. As I downloaded the web_text from the link, there were no links for parsing. However, opening the link on Chrome will automatically lead to #1 company page.
Could someone kindly help explain to me what happened and how I can trace the links to company page from the original url?
First you need to get the postid, then make a request to /data/franchise-list, then get the url from the first article:
import json
import re
from urllib2 import urlopen
from urlparse import urljoin
from bs4 import BeautifulSoup
data = urlopen('http://fortune.com/fortune500/')
soup = BeautifulSoup(data)
postid = next(attr for attr in soup.body['class'] if attr.startswith('postid'))
postid = re.match(r'postid-(\d+)', postid).group(1)
url = "http://fortune.com/data/franchise-list/{postid}/1/".format(postid=postid)
data = json.load(urlopen(url))
resulting_url = urljoin(url, data['articles'][0]['url'])
print resulting_url
Prints:
http://fortune.com/fortune500/wal-mart-stores-inc-1/

Categories