Scraping a certain range from a table at a webpage - python

I'm trying to scrape data from this website, which has a table of game credits for different categories. There are a total of 24 categories that I want to make in to 24 columns. In the example webpage there are 5 (production, design, engineering, and thanks).
It would have been easy if they have different class but they all have the same h3 class: "clean". Different page has different categories, and depending on the page the order changes too. On top of that the information that I need is actually in the next row of the table and in a different class.
So what I figured is if I can make 24 if statements for each categories to find if h3 class:"clean" has any of the categories, then I can scrape the class that I need and else put none. but the problem is all of them share the same class. So I think I can try to use td colspan="5" as a marker for python to let python know when each category ends and starts.
My question is that is there a way to program it to scrape when it encounters td colspan="5" and then stop ??
import bs4 as bs
import urllib.request
gameurl = "https://www.mobygames.com/developer/sheet/view/developerId,1"
req = urllib.request.Request(gameurl,headers={'User-Agent': 'Mozilla/5.0'})
sauce = urllib.request.urlopen(req).read()
soup = bs.BeautifulSoup(sauce,'lxml')
infopage = soup.find_all("div", {"class":"col-md-8 col-lg-8"})
core_list =[]
for credits in infopage:
niceHeaderTitle = credits.find_all("h1", {"class":"niceHeaderTitle"})
name = niceHeaderTitle[0].text
Titles = credits.find_all("h3", {"class":"clean"})
Titles = [title.get_text() for title in Titles]
if 'Business' in Titles:
businessinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
business = businessinfo[0].get_text(strip=True)
else:
business = 'none'
if 'Production' in Titles:
productioninfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
production = productioninfo[0].get_text(strip=True)
else:
production = 'none'
if 'Design' in Titles:
designinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
design = designinfo[0].get_text(strip=True)
else:
design = 'none'
if 'Writers' in Titles:
writersinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
writers = writersinfo[0].get_text(strip=True)
else:
writers = 'none'
if 'Writers' in Titles:
writersinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
writers = writersinfo[0].get_text(strip=True)
else:
writers = 'none'
if 'Programming/Engineering' in Titles:
programinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
program = programinfo[0].get_text(strip=True)
else:
video = 'none'
if 'Video/Cinematics' in Titles:
videoinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
video = videoinfo[0].get_text(strip=True)
else:
video = 'none'
if 'Audio' in Titles:
Audioinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
audio = Audioinfo[0].get_text(strip=True)
else:
audio = 'none'
if 'Art/Graphics' in Titles:
artinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
art = artinfo[0].get_text(strip=True)
else:
art = 'none'
if 'Support' in Titles:
supportinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
support = supportinfo[0].get_text(strip=True)
else:
support = 'none'
if 'Thanks' in Titles:
thanksinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
thanks = thanksinfo[0].get_text(strip=True)
else:
thanks = 'none'
games=[name,business,production,design,writers,video,audio,art,support,program,thanks]
core_list.append(games)
print (core_list)

Related

Filter strings scraped from input form in Python

How do I filter out certain skills like 'django' and 'Django' from a collection of skills provided by users through an input form using a Python function?
I've requests and bs4 to get the raw data, but I need to filter through the results. Here's my code so far:
from bs4 import BeautifulSoup
import requests
import time
unfamiliar_skills = list(map(str,input('>')))
def find_jobs():
html_text = requests.get('https://www.timesjobs.com/candidate/job-search.html?searchType=personalizedSearch&from=submit&txtKeywords=python&txtLocation=').text
soup = BeautifulSoup(html_text, 'lxml')
jobs = soup.find_all('li', class_ = 'clearfix job-bx wht-shd-bx')
jobs = soup.find_all('li', class_ = 'clearfix job-bx wht-shd-bx')
# we first created the parsing for one output, then used for loop to parse multiple instances of that.
for index, job in enumerate(jobs):
published_date = job.find('span', class_ = 'sim-posted').span.text # must be b 1st to prevent scraping if the pub date is not == few days ago
if 'few' in published_date:
company_name = job.find('h3', class_ = 'joblist-comp-name').text.replace(' ','')
skills = job.find('span', class_ = 'srp-skills').text.replace(' ','')
more_info = job.header.h2.a['href'] # like in a dictionary
if filter(unfamiliar_skills, skills):
with open(f'C:/Users/USER/{index}.txt', 'w') as f:
f.write(f'Company Name: {company_name.strip()} \n')
f.write(f'Required Skills: {skills.strip()} \n')
f.write(f'more_info: {more_info} \n')
print(f'File saved: {index}')
if __name__ == '__main__':
while True:
find_jobs()
time_wait = 10
print(f'Waiting {time_wait} minutes...')
time.sleep(time_wait*60)
Here is the printed output of skills variable:
rest,python,database,django,debugging,mongodb
python,webtechnologies,linux,mobile,mysql,angularjs,javascript
rest,python,security,debugging
python,docker,messaging,pythonscripting
python,git,django
python,database,django,mysql,api
python,hadoop,machinelearning
rest,python,django,git
python,django,,framework
python,java,scala
python,linux,windows,sql
python,webdeveloper,webservices
rest,python,database,django,api
Python,Django,Flask
python,django,javascript,webprogramming
python,Django,ObjectRelationalMapper
python,webtechnologies,webtechnologies
python,django,html5,javascript
python,django,html5,javascript
None

Web Crawler Looping the URL to crawl many pages

I am lost with making a loop to go through all of the pages on this book site. The url ends in 'all?page=' followed by the page number, so it should be easy I thought, but I'm stuck. All the info gathering works fine, I just don't know how to move to the next pages. Any help would be appreciated.
import requests
from bs4 import BeautifulSoup
URL = 'https://www.bookdepository.com/category/352/Science-Fiction/browse/viewmode/all?page=' +str(page)
page = 1
page += 1
for page in max_pages:
html = requests.get(URL)
soup = BeautifulSoup(html.content, "html.parser")
# ^This part I need help with^
# results = all books present on page
# books = each individual book on the page
results = soup.find(class_='tab search')
books = results.find_all('div', class_='book-item')
for book in books:
title = book.h3.a
author = book.p.span
# in case there is no rating on a book
if len(book.find('div','rating-wrap').findAll('span', 'full-star')) == None:
pass
else: rating = len(book.find('div','rating-wrap').findAll('span', 'full-star'))
publish_date = book.find(class_='published')
format = book.find(class_='format')
price = book.find('span', class_='sale-price').text.strip()
# if there is no discount
if book.find(class_='rrp') == None:
pass
else:
original_price = book.find(class_='rrp').text.strip()
if book.find(class_='price-save') == None:
pass
else:
discount = book.find(class_='price-save').text.strip()
# unneeded text removed such as 'US' before the price shown
price = price.replace('US', '')
original_price = original_price.replace('US', '')
discount = discount.replace('Save US', '')
# .text.strip() gets text and rids of empty spaces
print(title.text.strip())
print(author.text.strip())
print(rating, 'stars')
print(publish_date.text.strip())
print(format.text.strip())
print(price)
print(original_price)
print(discount, 'in savings!')
What the code does is it loops 5 times in this case with page going up one every singe time.
max_pages = 5
for page in range(max_pages):
URL = f"https://www.bookdepository.com/category/352/Science-Fiction/browse/viewmode/all?page={page}"
html = requests.get(URL)
soup = BeautifulSoup(html.content, "html.parser")

BeautifulSoup (Python): how grab text-string next to a tag (that may or may not exist)?

I think my title explains it pretty well the problem I am facing. Let's look at a picture of the problem. (You can find the web-page at this adress, however it has probably changed).
I have highlighted the text that I want to grab in blue, this is the model-year 2008. Now, it is not necessary for the seller to submit the model-year, so this may or may not exist. But when it does exist it always follows the <i> tag with class ="fa fa-calender". My solution so far has been to grab all the text whitin <p class="result-details> ... </p>" (this then becomes a list) and then choose the second element, conditioned on that <i class="fa fa-calender> ... </i> exists. Otherwise I do not grab anything.
Now, it seems as this does not work in general since that text that comes before the second element can be aranged into more than one element if has a whitespace in it. So, is there any way (any function) that can grab a text string that neighbours another tag as seen in my picture?
PS: if I have made myself unclear, I just want to fetch the year 2008 from the post on the web page if it exists.
Edit
In this situation my code erroneously gives my the word "Hjulvältar" (bulldozer in english) instead of the year 2008.
CODE
from bs4 import BeautifulSoup
from datetime import date
import requests
url_avvikande = ['bomliftar','teleskop-bomliftar','kompakta-sjalvgaende-bomlyftar','bandschaktare','reachstackers','staplare']
today = date.today().isoformat()
url_main = 'https://www.mascus.se'
produktgrupper = ['lantbruksmaskiner','transportfordon','skogsmaskiner','entreprenadmaskiner','materialhantering','gronytemaskiner']
kategorier = {
'lantbruksmaskiner': ['traktorer','sjalvgaende-falthackar','skordetroskor','atv','utv:er','snoskotrar'],
'transportfordon': ['fordonstruckar','elektriska-fordon','terrangfordon'],
'skogsmaskiner': ['skog-skordare','skog-gravmaskiner','skotare','drivare','fallare-laggare','skogstraktorer','lunnare','terminal-lastare'],
'entreprenadmaskiner': ['gravlastare','bandgravare','minigravare-7t','hjulgravare','midigravmaskiner-7t-12t','atervinningshanterare','amfibiska-gravmaskiner','gravmaskiner-med-frontskopa','gravmaskiner-med-lang-rackvidd','gravmaskiner-med-slapskopa','rivningsgravare','specialgravmaskiner','hjullastare','kompaktlastare','minilastmaskiner','bandlastare','teleskopiska-hjullastare','redaskapshallare','gruvlastare','truckar-och-lastare-for-gruvor','bergborriggar','teleskoplastare','dumprar','minidumprar','gruvtruckar','banddumprar','specialiserade-dragare','vaghyvlar','vattentankbilar','allterrangkranar','terrangkranar-grov-terrang','-bandgaende-kranar','saxliftar','bomliftar','teleskop-bomliftar','personhissar-och-andra-hissar','kompakta-sjalvgaende-bomlyftar','krossar','mobila-krossar','sorteringsverk','mobila-sorteringsverk','bandschaktare','asfaltslaggningsmaskiner','--asfaltskallfrasmaskiner','tvavalsvaltar','envalsvaltar','jordkompaktorer','pneumatiska-hjulvaltar','andra-valtar','kombirullar','borrutrustning-ytborrning','horisontella-borrutrustning','trenchers-skar-gravmaskin'],
'materialhantering': ['dieseltruckar','eldrivna-gaffeltruckar','lpg-truckar','gaffeltruckar---ovriga','skjutstativtruck','sidlastare','teleskopbomtruckar','terminaltraktorer','reachstackers','ovriga-materialhantering-maskiner','staplare-led','staplare','plocktruck-laglyftande','plocktruck-hoglyftande','plocktruck-mediumlyftande','dragtruck','terrangtruck','4-vagstruck','smalgangstruck','skurborsttorkar','inomhus-sopmaskiner','kombinationsskurborstar'],
'gronytemaskiner': ['kompakttraktorer','akgrasklippare','robotgrasklippare','nollsvangare','plattformsklippare','sopmaskiner','verktygsfraktare','redskapsbarare','golfbilar','fairway-grasklippare','green-grasklippare','grasmattevaltar','ovriga-gronytemaskiner']
}
url = 'https://www.mascus.se'
mappar = ['Lantbruk', 'Transportfordon', 'Skogsmaskiner', 'Entreprenad', 'Materialhantering', 'Grönytemaskiner']
index = -1
status = True
for produktgrupp in kategorier:
index += 1
mapp = mappar[index]
save_path = f'/home/protector.local/vika99/webscrape_mascus/Annonser/{mapp}'
underkategorier = kategorier[produktgrupp]
for underkategori in underkategorier:
# OBS
if underkategori != 'borrutrustning-ytborrning' and status:
continue
else:
status = False
# OBS
if underkategori in url_avvikande:
url = f'{url_main}/{produktgrupp}/{underkategori}'
elif underkategori == 'gravmaskiner-med-frontskopa':
url = f'{url_main}/{produktgrupp}/begagnat-{underkategori}'
elif underkategori == 'borrutrustning-ytborrning':
url = f'{url_main}/{produktgrupp}/begagnad-{underkategori}'
else:
url = f'{url_main}/{produktgrupp}/begagnade-{underkategori}'
file_name = f'{save_path}/{produktgrupp}_{underkategori}_{today}.txt'
sida = 1
print(url)
with open(file_name, 'w') as f:
while True:
print(sida)
html_text = None
soup = None
links = None
while links == None:
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, 'lxml')
links = soup.find('ul', class_ = 'page-numbers')
annonser = soup.find_all('li', class_ = 'col-row single-result')
for annons in annonser:
modell = annons.find('a', class_ = 'title-font').text
if annons.p.find('i', class_ = 'fa fa-calendar') != None:
tillverkningsar = annons.find('p', class_ = 'result-details').text.strip().split(" ")[1]
else:
tillverkningsar = 'Ej angiven'
try:
pris = annons.find('span', class_ = 'title-font no-ws-wrap').text
except AttributeError:
pris = annons.find('span', class_ = 'title-font no-price').text
f.write(f'{produktgrupp:<21}{underkategori:25}{modell:<70}{tillverkningsar:<13}{pris:>14}\n')
url_part = None
sida += 1
try:
url_part = links.find('a', text = f'{sida}')['href']
except TypeError:
print(f'Avläsning av underkategori klar.')
break
url = f'{url_main}{url_part}'
As you loop the listings you can test if that calendar icon class is present, if it is then grab the next_sibling
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.mascus.se/entreprenadmaskiner/begagnade-pneumatiska-hjulvaltar')
soup = bs(r.content, 'lxml')
listings = soup.select('.single-result')
for listing in listings:
calendar = listing.select_one('.fa-calendar')
if calendar is not None:
print(calendar.next_sibling)
else:
print('Not present')

Dynamic Web Scraping with Selenium

I was trying to scrape data from amazon using selenium and beautiful soup..
I have scraped and obtained data from the first page and have a defined a function for it and managed to get the second page opened with the Click() method...
The soup objects that were used in first page is similar to the objects in second page...I am planning to scrape data till page 6....
Was wondering if i could apply the function defined for the first page to the next 5 pages and append the data, which can later be exported as csv.
Any suggestions regarding this would be appreciated..
def data_collection():
title = soup.find_all(name = "span", class_ = "a-size-base-plus a-color-
base a-text-normal")
all_specs = [specs.getText() for specs in title]
brands = [items.split(' ', 1)[0] for items in all_specs] #Brand
phones = [text.split(')')[0].split('(') for text in all_specs]
spec = []
for i in phones:
for j in i:
spec.append(j)
model = spec[::2] #Model
specifications = spec[1::2] #Specs
s_price_obj = soup.find_all(name = "span", class_ = "a-price-whole")
selling_price = [price.getText() for price in s_price_obj] #Price
review_obj = soup.find_all(name = "span", class_ = "a-icon-alt")
review = [ratings.getText() for ratings in review_obj]
review = review[:24] #Ratings
quantity_obj = soup.find_all(name = "span", class_ = "a-size-base")
quantity_sold = [items.getText() for items in quantity_obj]
quantity_sold = quantity_sold[:24] #Quantity Sold
page_number = ['1']*24 #Page Number
Date = date.today()
Date = [str(Date)]*24 #Date
data = [brands, model, specifications, selling_price, review,
quantity_sold, page_number, Date]
return data
The above is the function defined...Open to suggestions
You can try the following:-
Re-define your data_collection method to accept page source parsed by BeautifulSoup
def data_collection(soup):
title = soup.find_all(name = "span", class_ = "a-size-base-plus a-color- base a-text-normal")
all_specs = [specs.getText() for specs in title]
brands = [items.split(' ', 1)[0] for items in all_specs] #Brand
phones = [text.split(')')[0].split('(') for text in all_specs]
spec = []
for i in phones:
for j in i:
spec.append(j)
model = spec[::2] #Model
specifications = spec[1::2] #Specs
s_price_obj = soup.find_all(name = "span", class_ = "a-price-whole")
selling_price = [price.getText() for price in s_price_obj] #Price
review_obj = soup.find_all(name = "span", class_ = "a-icon-alt")
review = [ratings.getText() for ratings in review_obj]
review = review[:24] #Ratings
quantity_obj = soup.find_all(name = "span", class_ = "a-size-base")
quantity_sold = [items.getText() for items in quantity_obj]
quantity_sold = quantity_sold[:24] #Quantity Sold
page_number = ['1']*24 #Page Number
Date = date.today()
Date = [str(Date)]*24 #Date
data = [brands, model, specifications, selling_price, review,
quantity_sold, page_number, Date]
return data
Then loop through each page, get the page source, parse it using BeautifulSoup and pass it to the data_collection function. Example:-
#from page(1..6)
for i in range(1,7):
#change page=i in the url to iterate through the pages
url=f'https://www.amazon.in/s?k=mobile+phones&page={i}&qid=1632394501&ref=sr_pg_2'
driver.get(url)
#get current page source
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'lxml')
#call data_collection function
data=data_collection(soup)
#code to append data to csv

How to webscrape reviews from external links with bs4?

I would like to extract for each movie at least 20 user reviews, but I don't know how to loop to get into the IMDb title movie and then to the user reviews with beautifulsoup.
start link = "https://www.imdb.com/search/title/?title_type=feature,tv_movie&release_date=2018-01-01,2019-12-31&count=250";
title_link(1) = "https://www.imdb.com/title/tt7131622/?ref_=adv_li_tt";
user_reviews_link_movie1 = "https://www.imdb.com/title/tt7131622/reviews?ref_=tt_ov_rt" ;
I am able to extract from a static page titles, years, ratings and metascores of each movie of the list.
# Import packages and set urls
from requests import get
url = 'https://www.imdb.com/search/title/?title_type=feature,tv_movie&release_date=2018-01-01,2019-12-31&count=250'
response = get(url)
print(response.text[:500])
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
movie_containers = html_soup.find_all('div', class_ = 'lister-item mode-advanced')
print(type(movie_containers))
print(len(movie_containers))
# Lists to store the scraped data in
names = []
years = []
imdb_ratings = []
metascores = []
votes = []
# Extract data from individual movie container
for container in movie_containers:
# If the movie has Metascore, then extract:
if container.find('div', class_ = 'ratings-metascore') is not None:
# The name
name = container.h3.a.text
names.append(name)
# The year
year = container.h3.find('span', class_ = 'lister-item-year').text
years.append(year)
# The IMDB rating
imdb = float(container.strong.text)
imdb_ratings.append(imdb)
# The Metascore
m_score = container.find('span', class_ = 'metascore').text
metascores.append(int(m_score))
import pandas as pd
test_df = pd.DataFrame({'movie': names,'year': years,'imdb': imdb_ratings,'metascore': metascores})
test_df
Actual results :
movie year imdb metascore
Once Upon a Time... in Hollywood (2019) (8.1) (83)
Scary Stories (2019) (6.5) (61)
Fast & Furious: Hobbs & Shaw (2019) (6.8) (60)
Avengers: Endgame (2019) (8.6) (78)
Expected :
movie1 year1 imbd1 metascore1 review1
movie1 year1 imbd1 metascore1 review2
...
movie1 year1 imbd1 metascore1 review20
movie2 year2 imbd2 metascore2 review1
...
movie2 year2 imbd2 metascore2 review20
...
movie250 year250 imbd250 metascore250 review20
Assuming that answer on my question in comments is "yes".
Below is a solution to your initial request.
There's a check whether a particular film really has 20 reviews. If less, then gather all available ones.
Technically parsing process is correct, I checked it when assigned movie_containers = movie_containers[:3]. Gathering all data will take some time.
UPDATE: just finished collecting info on all 250 films - everything is scraped without errors, so block after solution itself is just FYI.
Also if you want to go further with your parsing, I mean collect data for next 250 films and so on, you can add one more looping level to this parser. The process is similar to one in the "Reviews extracting" section.
# Import packages and set urls
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
base_url = 'https://www.imdb.com/search/title/?title_type=feature,tv_movie&release_date=2018-01-01,2019-12-31&count=250'
url_header_for_reviews = 'https://www.imdb.com'
url_tail_for_reviews = 'reviews?ref_=tt_urv'
base_response = get(base_url)
html_soup = BeautifulSoup(base_response.text, 'html.parser')
movie_containers = html_soup.find_all('div', class_ = 'lister-item mode-advanced')
result_df = pd.DataFrame()
# Extract data from individual movie container
for container in movie_containers:
# If the movie has Metascore, then extract:
if container.find('div', class_ = 'ratings-metascore') is not None:
# Reviews extracting
num_reviews = 20
# Getting last piece of link puzzle for a movie reviews` link
url_middle_for_reviews = container.find('a')['href']
# Opening reviews page of a concrete movie
response_reviews = get(url_header_for_reviews + url_middle_for_reviews + url_tail_for_reviews)
reviews_soup = BeautifulSoup(response_reviews.text, 'html.parser')
# Searching all reviews
reviews_containers = reviews_soup.find_all('div', class_ = 'imdb-user-review')
# Check if actual number of reviews is less than target one
if len(reviews_containers) < num_reviews:
num_reviews = len(reviews_containers)
# Looping through each review and extracting title and body
reviews_titles = []
reviews_bodies = []
for review_index in range(num_reviews):
review_container = reviews_containers[review_index]
review_title = review_container.find('a', class_ = 'title').text.strip()
review_body = review_container.find('div', class_ = 'text').text.strip()
reviews_titles.append(review_title)
reviews_bodies.append(review_body)
# The name
name = container.h3.a.text
names = [name for i in range(num_reviews)]
# The year
year = container.h3.find('span', class_ = 'lister-item-year').text
years = [year for i in range(num_reviews)]
# The IMDB rating
imdb_rating = float(container.strong.text)
imdb_ratings = [imdb_rating for i in range(num_reviews)]
# The Metascore
metascore = container.find('span', class_ = 'metascore').text
metascores = [metascore for i in range(num_reviews)]
# Gathering up scraped data into result_df
if result_df.empty:
result_df = pd.DataFrame({'movie': names,'year': years,'imdb': imdb_ratings,'metascore': metascores,'review_title': reviews_titles,'review_body': reviews_bodies})
elif num_reviews > 0:
result_df = result_df.append(pd.DataFrame({'movie': names,'year': years,'imdb': imdb_ratings,'metascore': metascores,'review_title': reviews_titles,'review_body': reviews_bodies}))
Btw I'm not sure that IMDB will let you gather data for all films in a loop as is. There's a possibility that you can get a captcha or redirection to some other page. If these issue appears,I'd go with a simple solution - pauses in scraping and/or changing user-agents.
Pause (sleep) can be implemented as follows:
import time
import numpy as np
time.sleep((30-5)*np.random.random()+5) #from 5 to 30 seconds
Inserting a user-agent in request can be done as follows:
import requests
from bs4 import BeautifulSoup
url = ('http://www.link_you_want_to_make_request_on.com/bla_bla')
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
Google some other variants of user-agents, make a list from them and change them from time to time in next requests. Watch out though which user-agents you use - some of them indicate mobile or tablet devices, and for them a site (not only IMDB) can give response pages in a format that differs from PC one - other markup, other design etc. So in general above algorithm works only for PC version of pages.

Categories