How to webscrape reviews from external links with bs4? - python

I would like to extract for each movie at least 20 user reviews, but I don't know how to loop to get into the IMDb title movie and then to the user reviews with beautifulsoup.
start link = "https://www.imdb.com/search/title/?title_type=feature,tv_movie&release_date=2018-01-01,2019-12-31&count=250";
title_link(1) = "https://www.imdb.com/title/tt7131622/?ref_=adv_li_tt";
user_reviews_link_movie1 = "https://www.imdb.com/title/tt7131622/reviews?ref_=tt_ov_rt" ;
I am able to extract from a static page titles, years, ratings and metascores of each movie of the list.
# Import packages and set urls
from requests import get
url = 'https://www.imdb.com/search/title/?title_type=feature,tv_movie&release_date=2018-01-01,2019-12-31&count=250'
response = get(url)
print(response.text[:500])
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
movie_containers = html_soup.find_all('div', class_ = 'lister-item mode-advanced')
print(type(movie_containers))
print(len(movie_containers))
# Lists to store the scraped data in
names = []
years = []
imdb_ratings = []
metascores = []
votes = []
# Extract data from individual movie container
for container in movie_containers:
# If the movie has Metascore, then extract:
if container.find('div', class_ = 'ratings-metascore') is not None:
# The name
name = container.h3.a.text
names.append(name)
# The year
year = container.h3.find('span', class_ = 'lister-item-year').text
years.append(year)
# The IMDB rating
imdb = float(container.strong.text)
imdb_ratings.append(imdb)
# The Metascore
m_score = container.find('span', class_ = 'metascore').text
metascores.append(int(m_score))
import pandas as pd
test_df = pd.DataFrame({'movie': names,'year': years,'imdb': imdb_ratings,'metascore': metascores})
test_df
Actual results :
movie year imdb metascore
Once Upon a Time... in Hollywood (2019) (8.1) (83)
Scary Stories (2019) (6.5) (61)
Fast & Furious: Hobbs & Shaw (2019) (6.8) (60)
Avengers: Endgame (2019) (8.6) (78)
Expected :
movie1 year1 imbd1 metascore1 review1
movie1 year1 imbd1 metascore1 review2
...
movie1 year1 imbd1 metascore1 review20
movie2 year2 imbd2 metascore2 review1
...
movie2 year2 imbd2 metascore2 review20
...
movie250 year250 imbd250 metascore250 review20

Assuming that answer on my question in comments is "yes".
Below is a solution to your initial request.
There's a check whether a particular film really has 20 reviews. If less, then gather all available ones.
Technically parsing process is correct, I checked it when assigned movie_containers = movie_containers[:3]. Gathering all data will take some time.
UPDATE: just finished collecting info on all 250 films - everything is scraped without errors, so block after solution itself is just FYI.
Also if you want to go further with your parsing, I mean collect data for next 250 films and so on, you can add one more looping level to this parser. The process is similar to one in the "Reviews extracting" section.
# Import packages and set urls
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
base_url = 'https://www.imdb.com/search/title/?title_type=feature,tv_movie&release_date=2018-01-01,2019-12-31&count=250'
url_header_for_reviews = 'https://www.imdb.com'
url_tail_for_reviews = 'reviews?ref_=tt_urv'
base_response = get(base_url)
html_soup = BeautifulSoup(base_response.text, 'html.parser')
movie_containers = html_soup.find_all('div', class_ = 'lister-item mode-advanced')
result_df = pd.DataFrame()
# Extract data from individual movie container
for container in movie_containers:
# If the movie has Metascore, then extract:
if container.find('div', class_ = 'ratings-metascore') is not None:
# Reviews extracting
num_reviews = 20
# Getting last piece of link puzzle for a movie reviews` link
url_middle_for_reviews = container.find('a')['href']
# Opening reviews page of a concrete movie
response_reviews = get(url_header_for_reviews + url_middle_for_reviews + url_tail_for_reviews)
reviews_soup = BeautifulSoup(response_reviews.text, 'html.parser')
# Searching all reviews
reviews_containers = reviews_soup.find_all('div', class_ = 'imdb-user-review')
# Check if actual number of reviews is less than target one
if len(reviews_containers) < num_reviews:
num_reviews = len(reviews_containers)
# Looping through each review and extracting title and body
reviews_titles = []
reviews_bodies = []
for review_index in range(num_reviews):
review_container = reviews_containers[review_index]
review_title = review_container.find('a', class_ = 'title').text.strip()
review_body = review_container.find('div', class_ = 'text').text.strip()
reviews_titles.append(review_title)
reviews_bodies.append(review_body)
# The name
name = container.h3.a.text
names = [name for i in range(num_reviews)]
# The year
year = container.h3.find('span', class_ = 'lister-item-year').text
years = [year for i in range(num_reviews)]
# The IMDB rating
imdb_rating = float(container.strong.text)
imdb_ratings = [imdb_rating for i in range(num_reviews)]
# The Metascore
metascore = container.find('span', class_ = 'metascore').text
metascores = [metascore for i in range(num_reviews)]
# Gathering up scraped data into result_df
if result_df.empty:
result_df = pd.DataFrame({'movie': names,'year': years,'imdb': imdb_ratings,'metascore': metascores,'review_title': reviews_titles,'review_body': reviews_bodies})
elif num_reviews > 0:
result_df = result_df.append(pd.DataFrame({'movie': names,'year': years,'imdb': imdb_ratings,'metascore': metascores,'review_title': reviews_titles,'review_body': reviews_bodies}))
Btw I'm not sure that IMDB will let you gather data for all films in a loop as is. There's a possibility that you can get a captcha or redirection to some other page. If these issue appears,I'd go with a simple solution - pauses in scraping and/or changing user-agents.
Pause (sleep) can be implemented as follows:
import time
import numpy as np
time.sleep((30-5)*np.random.random()+5) #from 5 to 30 seconds
Inserting a user-agent in request can be done as follows:
import requests
from bs4 import BeautifulSoup
url = ('http://www.link_you_want_to_make_request_on.com/bla_bla')
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
Google some other variants of user-agents, make a list from them and change them from time to time in next requests. Watch out though which user-agents you use - some of them indicate mobile or tablet devices, and for them a site (not only IMDB) can give response pages in a format that differs from PC one - other markup, other design etc. So in general above algorithm works only for PC version of pages.

Related

Filter strings scraped from input form in Python

How do I filter out certain skills like 'django' and 'Django' from a collection of skills provided by users through an input form using a Python function?
I've requests and bs4 to get the raw data, but I need to filter through the results. Here's my code so far:
from bs4 import BeautifulSoup
import requests
import time
unfamiliar_skills = list(map(str,input('>')))
def find_jobs():
html_text = requests.get('https://www.timesjobs.com/candidate/job-search.html?searchType=personalizedSearch&from=submit&txtKeywords=python&txtLocation=').text
soup = BeautifulSoup(html_text, 'lxml')
jobs = soup.find_all('li', class_ = 'clearfix job-bx wht-shd-bx')
jobs = soup.find_all('li', class_ = 'clearfix job-bx wht-shd-bx')
# we first created the parsing for one output, then used for loop to parse multiple instances of that.
for index, job in enumerate(jobs):
published_date = job.find('span', class_ = 'sim-posted').span.text # must be b 1st to prevent scraping if the pub date is not == few days ago
if 'few' in published_date:
company_name = job.find('h3', class_ = 'joblist-comp-name').text.replace(' ','')
skills = job.find('span', class_ = 'srp-skills').text.replace(' ','')
more_info = job.header.h2.a['href'] # like in a dictionary
if filter(unfamiliar_skills, skills):
with open(f'C:/Users/USER/{index}.txt', 'w') as f:
f.write(f'Company Name: {company_name.strip()} \n')
f.write(f'Required Skills: {skills.strip()} \n')
f.write(f'more_info: {more_info} \n')
print(f'File saved: {index}')
if __name__ == '__main__':
while True:
find_jobs()
time_wait = 10
print(f'Waiting {time_wait} minutes...')
time.sleep(time_wait*60)
Here is the printed output of skills variable:
rest,python,database,django,debugging,mongodb
python,webtechnologies,linux,mobile,mysql,angularjs,javascript
rest,python,security,debugging
python,docker,messaging,pythonscripting
python,git,django
python,database,django,mysql,api
python,hadoop,machinelearning
rest,python,django,git
python,django,,framework
python,java,scala
python,linux,windows,sql
python,webdeveloper,webservices
rest,python,database,django,api
Python,Django,Flask
python,django,javascript,webprogramming
python,Django,ObjectRelationalMapper
python,webtechnologies,webtechnologies
python,django,html5,javascript
python,django,html5,javascript
None

web scraping can't get data of all links in page at same time

From someday I am trying to crawl all vessel data from vesselfinder with its description page, like from description page I want its information like vessel type, Imo number etc. in table form. I try different way to do this but still a lot of errors. First, I found that how I go through these links to its description page, how to get all these links from all pages, also how to get specific table data from its description page (which is still not complete but get some).
But today I try get the data from all links with its description pages at same time, it gives me a lot of error which make me so confused (by combining the code).
I attached my code, which is not good but to this point #print(len(vessellist)) it work after that… errors..
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {
'user-agent': 'Mozilla/5.0',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
baseurl = 'https://www.vesselfinder.com/vessels'
vessellist = []
for x in range(1,6):
response = requests.get(
f'https://www.vesselfinder.com/vessels?page={x}',
headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
contents = soup.find_all('td', class_='v2')
for property in contents:
for item in property.find_all('a', href=True):
vessellist.append(baseurl + item['href'])
for link in vessellist:
response = requests.get(link, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find('table', class_ = 'tparams')
head = []
for i in table.find_all('td', class_ = 'n3'):
title = i.text
head.append(title)
values =[]
for row in table.find_all('td', class_ = 'v3'):
data = row.text
values.append(data)
df = pd.DataFrame(values)
print(df)
two steps: get summary data (includes href).Next get detailled ones. Theses two steps are implemented in two functions. Here I get first 10 pages, 200 are available.
import requests as rq
from bs4 import BeautifulSoup as bs
from requests.api import head
headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0"}
def getSummaryData():
data = []
url = "https://www.vesselfinder.com/vessels"
for page in range(1, 10+1, 1): # only 200 first pages autorized ?
print("Page : %d/10" % page)
resp = rq.get(url + "?page=%s" % page, headers=headers)
soup = bs(resp.content, "lxml")
section = soup.find_all('section', {'class', 'listing'})[0]
tbody = section.find_all('tbody')[0]
trs = tbody.find_all('tr')
for tr in trs:
tds = tr.find_all('td')
# column 1 data
sub = tds[1].find('a')
href = sub['href']
divs = sub.find_all('div')
country = divs[0]['title']
sub_divs = divs[1].find_all('div')
vessel_name = sub_divs[0].text
vessel_type = sub_divs[1].text
# column 2 data
build_year = tds[2].text
# column 3 data
gt = tds[3].text
# column 4 data
dwt = tds[4].text
# column 5 data
size = tds[5].text
# save data
tr_data = {'country': country,
'vessel_name': vessel_name,
'vessel_type': vessel_type,
'build_year': build_year,
'gt': gt,
'dwt': dwt,
'size': size,
'href': href}
data.append(tr_data)
return data
def getDetailledData(data):
for (iel, el) in enumerate(data):
print("%d/%d" % (iel+1, len(data)))
url = "https://www.vesselfinder.com" + el['href']
# make get call
resp = rq.get(url, headers=headers)
soup = bs(resp.content, "lxml")
# position and voyage data
table = soup.find_all('table', {'class', 'aparams'})[0]
trs = table.find_all('tr')
labels = ["course_speed", "current_draught","navigation_status",
"position_received", "IMO_MMSI", "callsign", "flag", "length_beam"]
for (i, tr) in enumerate(trs):
td = tr.find_all('td')[1]
el.update({'%s' % labels[i]: td.text})
# vessel particulars
table = soup.find_all('table', {'class', 'tparams'})[0]
trs = table.find_all('tr')
labels = ["IMO_number", "vessel_name", "ship_type", "flag",
"homeport", "gross_tonnage", "summer_deadweight_t",
"length_overall_m", "beam_m", "draught_m", "year_of_built",
"builder", "place_of_built", "yard", "TEU", "crude", "grain",
"bale", "classification_society", "registered_owner", "manager"]
for (i, tr) in enumerate(trs):
td = tr.find_all('td')[1]
el.update({'%s' % labels[i]: td.text})
#break
return data
Call theses functions :
data = getSummaryData() # href include
data = getDetailledData(data)
Don't rely on 'class' tag to target the data. Generally, you need to go throught table -> tbody and then get tds or trs to be sure that's the correct ones.

Python - Beautiful Soup: Webscraping PubMed - extracting PMIDs (an article ID), adding to list, and preventing duplicate scraping

I want to extract research abstracts on PubMed. I will have multiple URLs to search for publications and some of them will have the same articles as others. Each article has a unique ID called a PMID. Basically, the abstract of each URL is a substring + the PMID (example: https://pubmed.ncbi.nlm.nih.gov/ + 32663045). However, I don't want to extract the same article twice for multiple reasons (i.e., takes longer to complete the entire code, uses up more bandwidth), so once I extract the PMID, I add it to a list. I'm trying to make my code only extract information from the abstract just once, however my code is still extracting duplicate PMIDs and publication titles.
I know how to get rid of duplicates in Pandas in my output, but that's not what I want to do. I want to basically skip over PMIDs/URLs that I already scraped.
Current Output
Title| PMID
COVID-19 And Racial/Ethnic Disparities In Health Risk | 32663045
The Risk Of Severe COVID-19 | 32941086
COVID-19 And Racial/Ethnic Disparities In Health Risk | 32663045
The Risk Of Severe COVID-19 | 32941086
Desired Output
Title| PMID
COVID-19 And Racial/Ethnic Disparities In Health Risk | 32663045
The Risk Of Severe COVID-19 | 32941086
Here's my code:
from bs4 import BeautifulSoup
import csv
import time
import requests
import pandas as pd
all_pmids = []
out = []
search_urls = ['https://pubmed.ncbi.nlm.nih.gov/?term=%28AHRQ%5BAffiliation%5D%29+AND+%28COVID-19%5BText+Word%5D%29&sort=','https://pubmed.ncbi.nlm.nih.gov/?term=%28AHRQ%5BAffiliation%5D%29+AND+%28COVID-19%5BText+Word%5D%29&sort=']
for search_url in search_urls:
response = requests.get(search_url)
soup = BeautifulSoup(response.content, 'html.parser')
pmids = soup.find_all('span', {'class' : 'docsum-pmid'})
for p in pmids:
p = p.get_text()
all_pmids.append(p) if p not in all_pmids else print('project already in list, skipping')
for pmid in all_pmids:
url = 'https://pubmed.ncbi.nlm.nih.gov/'+pmid
response2 = requests.get(url)
soup2 = BeautifulSoup(response2.content, 'html.parser')
title = soup2.select('h1.heading-title')[0].text.strip()
data = {'title': title, 'pmid': pmid, 'url':url}
time.sleep(3)
out.append(data)
df = pd.DataFrame(out)
df.to_excel('my_results.xlsx')
Just an indentation error, or more accurately, where you are running your two for loops. If it isn't just an overlooked mistake, read the explanation. If it is just a mistake, unindent your second for loop.
Because you are searching all_pmids within your larger search_url loop without resetting it after each search, it finds the first two pmids, adds them to all_pmids, then runs the next loop for those two.
In the second run of the outer loop, it finds the next two pmids, sees they're already in ```all_pmids`` so doesn't add them, but still runs the inner loop on the first two still stored in the list.
You should run the inner loop separately, as such:
from bs4 import BeautifulSoup
import csv
import time
import requests
import pandas as pd
all_pmids = []
out = []
search_urls = ['https://pubmed.ncbi.nlm.nih.gov/?term=%28AHRQ%5BAffiliation%5D%29+AND+%28COVID-19%5BText+Word%5D%29&sort=','https://pubmed.ncbi.nlm.nih.gov/?term=%28AHRQ%5BAffiliation%5D%29+AND+%28COVID-19%5BText+Word%5D%29&sort=']
for search_url in search_urls:
response = requests.get(search_url)
soup = BeautifulSoup(response.content, 'html.parser')
pmids = soup.find_all('span', {'class' : 'docsum-pmid'})
for p in pmids:
p = p.get_text()
all_pmids.append(p) if p not in all_pmids else print('project already in list, skipping')
for pmid in all_pmids:
url = 'https://pubmed.ncbi.nlm.nih.gov/'+pmid
response2 = requests.get(url)
soup2 = BeautifulSoup(response2.content, 'html.parser')
title = soup2.select('h1.heading-title')[0].text.strip()
data = {'title': title, 'pmid': pmid, 'url':url}
time.sleep(3)
out.append(data)
df = pd.DataFrame(out)
df.to_excel('my_results.xlsx')
You should move the for pmid in all_pmids loop outside the for search_url in search_urls loop
...
for search_url in search_urls:
response = requests.get(search_url)
soup = BeautifulSoup(response.content, 'html.parser')
pmids = soup.find_all('span', {'class' : 'docsum-pmid'})
for p in pmids:
p = p.get_text()
all_pmids.append(p) if p not in all_pmids else print('project already in list, skipping')
## move this for loop outside!!
for pmid in all_pmids:
url = 'https://pubmed.ncbi.nlm.nih.gov/'+pmid
response2 = requests.get(url)
soup2 = BeautifulSoup(response2.content, 'html.parser')
...

How do I scrape "description" of movies in the IMDB website using BeautifulSoup?

I am using BeautifulSoup to scrape movies in the IMDB website. I was able to scrape name, genre, duration, rating of movies successfully. But I am not able to scrape description of the movies as when I am looking at the classes, it is "text-muted" and since this class is there multiple times holding other data such as rating, genre, duration. But since these data has inner classes also, so it was easier for me to scrape it but when it is coming to description, it does not have any inner class. So when pulling out data just using "text-muted" is giving other data also. How do I just get the description of the movies?
Attaching the code and screenshot for reference:
The sample code which I used to scrape genre is as follows:
genre_tags=data.select(".text-muted .genre")
genre=[g.get_text() for g in genre_tags]
Genre = [item.strip() for item in genre if str(genre)]
print(Genre)
In general, lxml is much better than beautifulsoup.
import requests
from lxml
import html
url = "xxxx"
r = requests.get(url)
tree = html.fromstring(r.text)
rows = tree.xpath('//div[#class="lister-item mode-detail"]')
for row in rows:
description = row.xpath('.//div[#class="ratings-bar"]/following-sibling::p[#class="text-muted"]/text()')[0].strip()
You can use this, :) , if helped you, UP my solution pls.. thks,
from bs4 import BeautifulSoup
from requests_html import HTMLSession
URL = 'https://www.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm' #url of Most Popular Movies in IMDB
PAGE = HTMLSession().get(URL)
PAGE_BS4 = BeautifulSoup(PAGE.html.html,'html.parser')
MoviesObj = PAGE_BS4.find_all("tbody","lister-list") #get table body of Most Popular Movies
for index in range(len(MoviesObj[0].find_all("td","titleColumn"))):
a = list(MoviesObj[0].find_all("td","titleColumn")[index])[1]
href = 'https://www.imdb.com'+a.get('href') #get each link for movie page
moviepage = HTMLSession().get(href) #request each page of movie
moviepage = BeautifulSoup(moviepage.html.html,'html.parser')
title = list(moviepage.find_all('h1')[0].stripped_strings)[0] #parse title
year = list(moviepage.find_all('h1')[0].stripped_strings)[2] #parse year
try:
score = list(moviepage.find_all('div','ratingValue')[0].stripped_strings)[0] #parse score if is available
except IndexError:
score = '-' #if score is not available '-' is filled
description = list(moviepage.find_all('div','summary_text')[0].stripped_strings)[0] #parse description
print(f'TITLE: {title} YEAR: {year} SCORE: {score}\nDESCRIPTION:{description}\n')
PRINT
Junior Saldanha
#UmSaldanha

Website scraping with python 2.7 and beautifulsoup 4

I have stuck at one point while scraping website "http://www.queensbronxba.com/directory/" with beautifulsoup. I'm almost done with scraping and I left only company name from the list which is found in paragraph tag. The problem is that there are more paragraph tags in the same div but I only need the first one as it gives the company name. So I need first paragraph on following div's also not just at first one. This is the code I used to srcape:
page = requests.get("http://www.queensbronxba.com/directory/")
soup = BeautifulSoup(page.content, 'html.parser')
company = soup.find(class_="boardMemberWrap")
contact = company.find_all(class_="boardMember")
info = contact[0]
print(info.prettify())
name_tags = company.select("h4")
names = [nt.get_text() for nt in company_tags]
names
company_tags = company.select("p") #here I need help to get only first paragraphs of following div containers
companies = [ct.get_text() for ct in company_tags]
companies
phone_tags = company.select('a[href^="tel"]')
phones = [pt.get_text() for pt in phone_tags]
phones
email_tags = company.select('a[href^="mailto"]')
emails = [et.get_text() for et in email_tags]
emails
import requests
from bs4 import BeautifulSoup
page = requests.get("http://www.queensbronxba.com/directory/")
soup = BeautifulSoup(page.content, 'html.parser')
company = soup.find(class_="boardMemberWrap")
contact = company.findAll(class_="boardMemberInfo")
info = contact[0]
print(info.prettify())
name_tags = company.select("h4")
names = [nt.get_text() for nt in name_tags]
print(names)
for name in company.findAll(class_="boardMember"):
for n in name.findAll('p')[:1]:
print(n.text)
phone_tags = company.select('a[href^="tel"]')
phones = [pt.get_text() for pt in phone_tags]
print(phones)
email_tags = company.select('a[href^="mailto"]')
emails = [et.get_text() for et in email_tags]
print(emails)

Categories