I'm currently using this code to web scrape reviews from TrustPilot. I wish to adjust the code to scrape reviews from (https://boxes.mysubscriptionaddiction.com/box/boxycharm?ratings=true#review-update-create). However, unlike most other review sites, the reviews are not separated into multiple sub-pages but there is instead a button at the end of the page to "view more reviews" which shows 3 additional reviews whenever you press it.
Is it possible to adjust the code such that it is able to scrape all the reviews from this particular product within the website with this kind of web structure?
from bs4 import BeautifulSoup
import requests
import pandas as pd
import json
print ('all imported successfuly')
# Initialize an empty dataframe
df = pd.DataFrame()
for x in range(1, 44):
names = []
headers = []
bodies = []
ratings = []
published = []
updated = []
reported = []
link = (f'https://www.trustpilot.com/review/birchbox.com?page={x}')
print (link)
req = requests.get(link)
content = req.content
soup = BeautifulSoup(content, "lxml")
articles = soup.find_all('article', {'class':'review'})
for article in articles:
names.append(article.find('div', attrs={'class': 'consumer-information__name'}).text.strip())
headers.append(article.find('h2', attrs={'class':'review-content__title'}).text.strip())
try:
bodies.append(article.find('p', attrs={'class':'review-content__text'}).text.strip())
except:
bodies.append('')
try:
#ratings.append(article.find('div', attrs={'class':'star-rating star-rating--medium'}).text.strip())
#ratings.append(article.find('div', attrs={'class': 'star-rating star-rating--medium'})['alt'])
ratings.append(article.find_all("img", alt=True)[0]["alt"])
except:
ratings.append('')
dateElements = article.find('div', attrs={'class':'review-content-header__dates'}).text.strip()
jsonData = json.loads(dateElements)
published.append(jsonData['publishedDate'])
updated.append(jsonData['updatedDate'])
reported.append(jsonData['reportedDate'])
# Create your temporary dataframe of the first iteration, then append that into your "final" dataframe
temp_df = pd.DataFrame({'User Name': names, 'Header': headers, 'Body': bodies, 'Rating': ratings, 'Published Date': published, 'Updated Date':updated, 'Reported Date':reported})
df = df.append(temp_df, sort=False).reset_index(drop=True)
print ('pass1')
df.to_csv('BirchboxReviews2.0.csv', index=False, encoding='utf-8')
print ('excel done')
Basically you are dealing with a website which is dynamically loaded via JavaScript code once the page loads, where the comments is rendered with JS code on each scroll down.
I've been able to navigate to the XHR request which obtain the Comments from JS and I've been able to call it and retrieve all comments you asked for.
You don't need to use selenium as it's will slow down your task process.
Here you can achieve your target. assuming that each page include 3 comments. so we just math it to work on the full pages.
import requests
from bs4 import BeautifulSoup
import math
def PageNum():
r = requests.get(
"https://boxes.mysubscriptionaddiction.com/box/boxycharm?ratings=true#review-update-create")
soup = BeautifulSoup(r.text, 'html.parser')
num = int(
soup.find("a", class_="show-more-reviews").text.split(" ")[3][1:-1])
if num % 3 == 0:
return (num / 3) + 1
else:
return math.ceil(num / 3) + 2
def Main():
num = PageNum()
headers = {
'X-Requested-With': 'XMLHttpRequest'
}
with requests.Session() as req:
for item in range(1, num):
print(f"Extracting Page# {item}")
r = req.get(
f"https://boxes.mysubscriptionaddiction.com/get_user_reviews?box_id=105&page={item}", headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
for com in soup.findAll("div", class_=r'\"comment-body\"'):
print(com.text[5:com.text.find(r"\n", 3)])
Main()
Simple of the output:
Number of Pages 49
Extracting Page# 1
****************************************
I think Boxycharm overall is the best beauty subscription. However, I think it's
ridiculous that if you want to upgrade you have to pay the 25 for the first box and then add additional money to get the premium. Even though it's only one time,
that's insane. So about 80 bucks just to switch to Premium. And suppose U do that and then my Boxy Premium shows up at my door. I open it ....and absolutely hate
the majority if everything I have. Yeah I would be furious! Not worth taking a chance on. Boxy only shows up half the time with actual products or colors I use.
I love getting the monthly boxes, just wish they would have followed my preferences for colors!
I used to really get excited for my boxes. But not so much anymore. This months
Fenty box choices lack! I am not a clown
Extracting Page# 2
****************************************
Love it its awsome
Boxycharm has always been a favorite subscription box, I’ve had it off and on , love most of the goodies. I get frustrated when they don’t curate it to fit me and or customer service isn’t that helpful but overall a great box’!
I like BoxyCharm but to be honest I feel like some months they don’t even look at your beauty profile because I sometimes get things I clearly said I wasn’t interested in getting.
Extracting Page# 3
****************************************
The BEST sub box hands down.
I love all the boxy charm boxes everything is amazing all full size products and
the colors are outstanding
I absolutely love Boxycharm. I have received amazing high end products. My makeup cart is so full I have such a variety everyday. I love the new premium box and paired with Boxyluxe I recieve 15 products for $85 The products are worth anywhere from $500 to $700 total. I used to spend $400 a month buying products at Ulta. I would HIGHLY recommend this subscription.
Also I have worked out the code for your website. It uses selenium for button clicks and scrolling do let me know if you have any doubts. I still suggest you go through the article first:-
# -*- coding: utf-8 -*-
"""
Created on Sun Mar 8 18:09:45 2020
#author: prakharJ
"""
from selenium import webdriver
import time
import pandas as pd
names_found = []
comments_found = []
ratings_found = []
dateElements_found = []
# Web extraction of web page boxes
print("scheduled to run boxesweb scrapper ")
driver = webdriver.Chrome(executable_path='Your/path/to/chromedriver.exe')
webpage = 'https://boxes.mysubscriptionaddiction.com/box/boxycharm?ratings=true#review-update-create'
driver.get(webpage)
SCROLL_PAUSE_TIME = 6
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight*0.80);")
time.sleep(SCROLL_PAUSE_TIME)
try:
b = driver.find_element_by_class_name('show-more-reviews')
b.click()
time.sleep(SCROLL_PAUSE_TIME)
except Exception:
s ='no button'
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
names_list = driver.find_elements_by_class_name('name')
comment_list = driver.find_elements_by_class_name('comment-body')
rating_list = driver.find_elements_by_xpath("//meta[#itemprop='ratingValue']")
date_list = driver.find_elements_by_class_name('comment-date')
for names in names_list:
names_found.append(names.text)
for bodies in comment_list:
try:
comments_found.append(bodies.text)
except:
comments_found.append('NA')
for ratings in rating_list:
try:
ratings_found.append(ratings.get_attribute("content"))
except:
ratings_found.append('NA')
for dateElements in date_list:
dateElements_found.append(dateElements.text)
# Create your temporary dataframe of the first iteration, then append that into your "final" dataframe
temp_df = pd.DataFrame({'User Name': names_found, 'Body': comments_found, 'Rating': ratings_found, 'Published Date': dateElements_found})
#df = df.append(temp_df, sort=False).reset_index(drop=True)
print('extraction completed for the day and system goes into sleep mode')
driver.quit()
Related
I've got a python script that scrapes the first page on an auction site. The page it's scraping is trademe.co.nz - similar to ebay/amazon etc. It's purpose is to scrape all listings on the first page - only if it's not in my database. It's working as expected with one caveat - it's only scraping the first 8 listings (regardless of trademe url) & then exits with code 0 in visual studio code. If I try to run it again it exits immediately as it thinks there are no new auction IDs. If a new listing gets added & I run the script again - it will add the new one.
from bs4 import BeautifulSoup
from time import sleep
import requests
import datetime
import sqlite3
# Standard for all scrapings
dateAdded = datetime.datetime.now().strftime("%d/%m/%Y %H:%M:%S")
def mechanicalKeyboards():
url = "https://www.trademe.co.nz/a/marketplace/computers/peripherals/keyboards/mechanical/search?condition=used&sort_order=expirydesc"
category = "Mechanical Keyboards"
dateAdded = datetime.datetime.now().strftime("%d/%m/%Y %H:%M:%S")
trademeLogo = "https://www.trademe.co.nz/images/frend/trademe-logo-no-tagline.png"
# getCode = requests.get(url).status_code
# print(getCode)
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
listingContainer = soup.select(".tm-marketplace-search-card__wrapper")
conn = sqlite3.connect('trademe.db')
c = conn.cursor()
c.execute('''SELECT ID FROM trademe ORDER BY DateAdded DESC ''')
allResult = str(c.fetchall())
for listing in listingContainer:
title = listing.select("#-title")
location = listing.select("#-region")
auctionID = listing['data-aria-id'].split("-").pop()
fullListingURL = "https://www.trademe.co.nz/a/" + auctionID
image = listing.select("picture img")
try:
buyNow = listing.select(".tm-marketplace-search-card__footer-pricing-row")[0].find(class_="tm-marketplace-search-card__price ng-star-inserted").text.strip()
except:
buyNow = "None"
try:
price = listing.select(".tm-marketplace-search-card__footer-pricing-row")[0].find(class_="tm-marketplace-search-card__price").text.strip()
except:
price = "None"
for t, l, i in zip(title, location, image):
if auctionID not in allResult:
print("Adding new data - " + t.text)
c.execute(''' INSERT INTO trademe VALUES(?,?,?,?)''', (auctionID, t.text, dateAdded, fullListingURL))
conn.commit()
sleep(5)
I thought perhaps I was getting rate-limited, but I get a 200 status code & changing URLs work for the first 8 listings again. I had a look at the elements & can't see any changes after the 8th listing. I'm hoping someone could assist, thanks so much.
When using requests.get(url) to scrape a website with lazy-loaded content, it only return the HTML with images for the first 8 listings, causing the zip(title, location, image) function to only yield 8 items since image variable is empty list after the 8th listing in listingContainer
To properly scrape this type of website, I would recommended using tools such as Playwright or Selenium.
Hi I am a Newbie to programming. So I spent 4 days trying to learn python. I evented some new swear words too.
I was particularly interested in trying as an exercise some web-scraping to learn something new and get some exposure to see how it all works.
This is what I came up with. See code at end. It works (to a degree)
But what's missing?
This website has pagination on it. In this case 11 pages worth. How would you go about adding to this script and getting python to go look at those other pages too and carry out the same scrape. Ie scrape page one , scrape page 2, 3 ... 11 and post the results to a csv?
https://www.organicwine.com.au/vegan/?pgnum=1
https://www.organicwine.com.au/vegan/?pgnum=2
https://www.organicwine.com.au/vegan/?pgnum=3
https://www.organicwine.com.au/vegan/?pgnum=4
https://www.organicwine.com.au/vegan/?pgnum=5
https://www.organicwine.com.au/vegan/?pgnum=6
https://www.organicwine.com.au/vegan/?pgnum=7
8, 9,10, and 11
On these pages the images are actually a thumbnail images something like 251px by 251px.
How would you go about adding to this script to say. And whilst you are at it follow the links to the detailed product page and capture the image link from there where the images are 1600px by 1600px and post those links to CSV
https://www.organicwine.com.au/mercer-wines-preservative-free-shiraz-2020
When we have identified those links lets also download those larger images to a folder
CSV writer. Also I don't understand line 58
for i in range(23)
how would i know how many products there were without counting them (i.e. there is 24 products on page one)
So this is what I want to learn how to do. Not asking for much (he says sarcastically) I could pay someone on up-work to do it but where's the fun in that? and that does not teach me how to 'fish'.
Where is a good place to learn python? A master class on web-scraping. It seems to be trial and error and blog posts and where ever you can pick up bits of information to piece it all together.
Maybe I need a mentor.
I wish there had been someone I could have reached out to, to tell me what beautifulSoup was all about. worked it out by trial and error and mostly guessing. No understanding of it but it just works.
Anyway, any help in pulling this all together to produce a decent script would be greatly appreciated.
Hopefully there is someone out there who would not mind helping me.
Apologies to organicwine for using their website as a learning tool. I do not wish to cause any harm or be a nuisance to the site
Thank you in advance
John
code:
import requests
import csv
from bs4 import BeautifulSoup
URL = "https://www.organicwine.com.au/vegan/?pgnum=1"
response = requests.get(URL)
website_html = response.text
soup = BeautifulSoup(website_html, "html.parser")
product_title = soup.find_all('div', class_="caption")
# print(product_title)
winename = []
for wine in product_title:
winetext = wine.a.text
winename.append(winetext)
print(f'''Wine Name: {winetext}''')
# print(f'''\nWine Name: {winename}\n''')
product_price = soup.find_all('div', class_='wrap-thumb-mob')
# print(product_price.text)
price =[]
for wine in product_price:
wineprice = wine.span.text
price.append(wineprice)
print(f'''Wine Price: {wineprice}''')
# print(f'''\nWine Price: {price}\n''')
image =[]
product_image_link = (soup.find_all('div', class_='thumbnail-image'))
# print(product_image_link)
for imagelink in product_image_link:
wineimagelink = imagelink.a['href']
image.append(wineimagelink)
# image.append(imagelink)
print(f'''Wine Image Lin: {wineimagelink}''')
# print(f'''\nWine Image: {image}\n''')
#
#
# """ writing data to CSV """
# open OrganicWine2.csv file in "write" mode
# newline stops a blank line appearing in csv
with open('OrganicWine2.csv', 'w',newline='') as file:
# create a "writer" object
writer = csv.writer(file, delimiter=',')
# use "writer" obj to write
# you should give a "list"
writer.writerow(["Wine Name", "Wine Price", "Wine Image Link"])
for i in range(23):
writer.writerow([
winename[i],
price[i],
image[i],
])
In this case, to do pagination, instead of for i in range(1, 100) which is a hardcoded way of paging, it's better to use a while loop to dynamically paginate all possible pages.
"While" is an infinite loop and it will be executed until the transition to the next page is possible, in this case it will check for the presence of the button for the next page, for which the CSS selector ".fa-chevron-right" is responsible:
if soup.select_one(".fa-chevron-right"):
params["pgnum"] += 1 # go to the next page
else:
break
To extract the full size image an additional request is required, CSS selector ".main-image a" is responsible for full-size images:
full_image_html = requests.get(link, headers=headers, timeout=30)
image_soup = BeautifulSoup(full_image_html.text, "lxml")
try:
original_image = f'https://www.organicwine.com.au{image_soup.select_one(".main-image a")["href"]}'
except:
original_image = None
An additional step to avoid being blocked is to rotate user-agents. Ideally, it would be better to use residential proxies with random user-agent.
pandas can be used to extract data in CSV format:
pd.DataFrame(data=data).to_csv("<csv_file_name>.csv", index=False)
For a quick and easy search for CSS selectors, you can use the SelectorGadget Chrome extension (not always work perfectly if the website is rendered via JavaScript).
Check code with pagination and saving information to CSV in online IDE.
from bs4 import BeautifulSoup
import requests, json, lxml
import pandas as pd
# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
}
params = {
'pgnum': 1 # number page by default
}
data = []
while True:
page = requests.get(
"https://www.organicwine.com.au/vegan/?",
params=params,
headers=headers,
timeout=30,
)
soup = BeautifulSoup(page.text, "lxml")
print(f"Extracting page: {params['pgnum']}")
for products in soup.select(".price-btn-conts"):
try:
title = products.select_one(".new-h3").text
except:
title = None
try:
price = products.select_one(".price").text.strip()
except:
price = None
try:
snippet = products.select_one(".price-btn-conts p a").text
except:
snippet = None
try:
link = products.select_one(".new-h3 a")["href"]
except:
link = None
# additional request is needed to extract full size image
full_image_html = requests.get(link, headers=headers, timeout=30)
image_soup = BeautifulSoup(full_image_html.text, "lxml")
try:
original_image = f'https://www.organicwine.com.au{image_soup.select_one(".main-image a")["href"]}'
except:
original_image = None
data.append(
{
"title": title,
"price": price,
"snippet": snippet,
"link": link,
"original_image": original_image
}
)
if soup.select_one(".fa-chevron-right"):
params["pgnum"] += 1
else:
break
# save to CSV (install, import pandas as pd)
pd.DataFrame(data=data).to_csv("<csv_file_name>.csv", index=False)
print(json.dumps(data, indent=2, ensure_ascii=False))
Example output:
[
{
"title": "Yangarra McLaren Vale GSM 2016",
"price": "$29.78 in a straight 12\nor $34.99 each",
"snippet": "The Yangarra GSM is a careful blending of Grenache, Shiraz and Mourvèdre in which the composition varies from year to year, conveying the traditional estate blends of the southern Rhône. The backbone of the wine comes fr...",
"link": "https://www.organicwine.com.au/yangarra-mclaren-vale-gsm-2016",
"original_image": "https://www.organicwine.com.au/assets/full/YG_GSM_16.png?20211110083637"
},
{
"title": "Yangarra Old Vine Grenache 2020",
"price": "$37.64 in a straight 12\nor $41.99 each",
"snippet": "Produced from the fruit of dry grown bush vines planted high up in the Estate's elevated vineyards in deep sandy soils. These venerated vines date from 1946 and produce a wine that is complex, perfumed and elegant with a...",
"link": "https://www.organicwine.com.au/yangarra-old-vine-grenache-2020",
"original_image": "https://www.organicwine.com.au/assets/full/YG_GRE_20.jpg?20210710165951"
},
#...
]
Create the URL by putting the page number in it, then put the rest of your code into a for loop and you can use len(winenames) to count how many results you have. You should do the writing outside the for loop. Here's your code with those changes:
import requests
import csv
from bs4 import BeautifulSoup
num_pages = 11
result = []
for pgnum in range(num_pages):
url = f"https://www.organicwine.com.au/vegan/?pgnum={pgnum+1}"
response = requests.get(url)
website_html = response.text
soup = BeautifulSoup(website_html, "html.parser")
product_title = soup.find_all("div", class_="caption")
winename = []
for wine in product_title:
winetext = wine.a.text
winename.append(winetext)
product_price = soup.find_all("div", class_="wrap-thumb-mob")
price = []
for wine in product_price:
wineprice = wine.span.text
price.append(wineprice)
image = []
product_image_link = soup.find_all("div", class_="thumbnail-image")
for imagelink in product_image_link:
winelink = imagelink.a["href"]
response = requests.get(winelink)
wine_page_soup = BeautifulSoup(response.text, "html.parser")
main_image = wine_page_soup.find("a", class_="fancybox")
image.append(main_image['href'])
for i in range(len(winename)):
result.append([winename[i], price[i], image[i]])
with open("/tmp/OrganicWine2.csv", "w", newline="") as file:
writer = csv.writer(file, delimiter=",")
writer.writerow(["Wine Name", "Wine Price", "Wine Image Link"])
writer.writerows(results)
And here's how I would rewrite your code to accomplish this task. It's more pythonic (you should basically never write range(len(something)), there's always a cleaner way) and it doesn't require knowing how many pages of results there are:
import csv
import itertools
import time
import requests
from bs4 import BeautifulSoup
data = []
# Try opening 100 pages at most, in case the scraping code is broken
# which can happen because websites change.
for pgnum in range(1, 100):
url = f"https://www.organicwine.com.au/vegan/?pgnum={pgnum}"
response = requests.get(url)
website_html = response.text
soup = BeautifulSoup(website_html, "html.parser")
search_results = soup.find_all("div", class_="thumbnail")
for search_result in search_results:
name = search_result.find("div", class_="caption").a.text
price = search_result.find("p", class_="price").span.text
# link to the product's page
link = search_result.find("div", class_="thumbnail-image").a["href"]
# get the full resolution product image
response = requests.get(link)
time.sleep(1) # rate limit
wine_page_soup = BeautifulSoup(response.text, "html.parser")
main_image = wine_page_soup.find("a", class_="fancybox")
image_url = main_image["href"]
# or you can just "guess" it from the thumbnail's URL
# thumbnail = search_result.find("div", class_="thumbnail-image").a.img['src']
# image_url = thumbnail.replace('/thumbL/', '/full/')
data.append([name, price, link, image_url])
# if there's no "next page" button or no search results on the current page,
# stop scraping
if not soup.find("i", class_="fa-chevron-right") or not search_results:
break
# rate limit
time.sleep(1)
with open("/tmp/OrganicWine3.csv", "w", newline="") as file:
writer = csv.writer(file, delimiter=",")
writer.writerow(["Wine Name", "Wine Price", "Wine Link", "Wine Image Link"])
writer.writerows(data)
I have several URLs which link to Hotel pages and I would like to scrape some data from it.
I'm using the following this script, but I would like to update it:
data=[]
for i in range(0,10):
url = final_list[i]
driver2 = webdriver.Chrome()
driver2.get(url)
sleep(randint(10,20))
soup = BeautifulSoup(driver2.page_source, 'html.parser')
my_table2 = soup.find_all(class_=['title-2', 'rating-score body-3'])
review=soup.find_all(class_='reviews')[-1]
try:
price=soup.find_all('span', attrs={'class':'price'})[-1]
except:
price=soup.find_all('span', attrs={'class':'price'})
for tag in my_table2:
data.append(tag.text.strip())
for p in price:
data.append(p)
for r in review:
data.append(r)
But here's the problem, tag.text.strip() scrape rating numbers like here :
It will strip the number rating into alone value but some hotels don't have the same amout of ratings. Here's a hotel with 7 ratings, the default number is 8. Some have seven ratings, other six, and so on. So in the end, my dataframe is quite screwed. If the hotel doesn't have 8 ratings, the value will be shifted.
My question is : How to tell the script "if there is a value in this tag.text.strip(i) so put the value but if there isn't put None. And of course made that for the eight value.
I tried several things like :
for tag in my_table2:
for i in tag.text.strip()[i]:
if i:
data.append(i)
else:
data.append(None)
But unfortunately, that goes nowhere, so if you could help to figure out the answer, it would be awesome :)
If that could help you, I put link on Hotel that I'm scraping :
https://www.hostelworld.com/pwa/hosteldetails.php/Itaca-Hostel/Barcelona/1279?from=2020-11-21&to=2020-11-22&guests=1
The number ratings are at the end
Thank you.
A few suggestions:
Put your data in a dictionary. You don't have to assume that all tags are present and the order of the tags doesn't matter. You can get the labels and the corresponding ratings with
rating_labels = soup.find_all(class_=['rating-label body-3'])
rating_scores = soup.find_all(class_=['rating-score body-3'])
and then iterate over both lists with zip
move your driver outside of the loop, opening it once is enough
don't use wait but you use Selenium's wait functions. You can wait for a particular element to be present or populated with WebDriverWait(driver, 10).until(EC.presence_of_element_located(your_element)
https://selenium-python.readthedocs.io/waits.html
Cache your scraped HTML code to a file. It's faster for you and politer to the website you are scraping
import selenium
import selenium.webdriver
import time
import random
import os
from bs4 import BeautifulSoup
data = []
final_list = [
'https://www.hostelworld.com/pwa/hosteldetails.php/Itaca-Hostel/Barcelona/1279?from=2020-11-21&to=2020-11-22&guests=1',
'https://www.hostelworld.com/pwa/hosteldetails.php/Be-Ramblas-Hostel/Barcelona/435?from=2020-11-27&to=2020-11-28&guests=1'
]
# load your driver only once to save time
driver = selenium.webdriver.Chrome()
for url in final_list:
data.append({})
# cache the HTML code to the filesystem
# generate a filename from the URL where all non-alphanumeric characters (e.g. :/) are replaced with underscores _
filename = ''.join([s if s.isalnum() else '_' for s in url])
if not os.path.isfile(filename):
driver.get(url)
# better use selenium's wait functions here
time.sleep(random.randint(10, 20))
source = driver.page_source
with open(filename, 'w', encoding='utf-8') as f:
f.write(source)
else:
with open(filename, 'r', encoding='utf-8') as f:
source = f.read()
soup = BeautifulSoup(source, 'html.parser')
review = soup.find_all(class_='reviews')[-1]
try:
price = soup.find_all('span', attrs={'class':'price'})[-1]
except:
price = soup.find_all('span', attrs={'class':'price'})
data[-1]['name'] = soup.find_all(class_=['title-2'])[0].text.strip()
rating_labels = soup.find_all(class_=['rating-label body-3'])
rating_scores = soup.find_all(class_=['rating-score body-3'])
assert len(rating_labels) == len(rating_scores)
for label, score in zip(rating_labels, rating_scores):
data[-1][label.text.strip()] = score.text.strip()
data[-1]['price'] = price.text.strip()
data[-1]['review'] = review.text.strip()
The data can then be easily put in a nicely formatted table using Pandas
import pandas as pd
df = pd.DataFrame(data)
df
If some data is missing/incomplete, Pandas will replace it with 'NaN'
data.append(data[0].copy())
del(data[-1]['Staff'])
data[-1]['name'] = 'Incomplete Hostel'
pd.DataFrame(data)
I'm trying to scrape https://arxiv.org/search/?query=healthcare&searchtype=allI through the Selenium and python. The for loop takes too long to execute. I tried to scrape with headless browsers and PhantomJS, but it doesnt scrape the abstract field (Need the abstract field expanded with the more button clicked)
import pandas as pd
import selenium
import re
import time
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver import Firefox
browser = Firefox()
url_healthcare = 'https://arxiv.org/search/?query=healthcare&searchtype=all'
browser.get(url_healthcare)
dfs = []
for i in range(1, 39):
articles = browser.find_elements_by_tag_name('li[class="arxiv-result"]')
for article in articles:
title = article.find_element_by_tag_name('p[class="title is-5 mathjax"]').text
arxiv_id = article.find_element_by_tag_name('a').text.replace('arXiv:','')
arxiv_link = article.find_elements_by_tag_name('a')[0].get_attribute('href')
pdf_link = article.find_elements_by_tag_name('a')[1].get_attribute('href')
authors = article.find_element_by_tag_name('p[class="authors"]').text.replace('Authors:','')
try:
link1 = browser.find_element_by_link_text('▽ More')
link1.click()
except:
time.sleep(0.1)
abstract = article.find_element_by_tag_name('p[class="abstract mathjax"]').text
date = article.find_element_by_tag_name('p[class="is-size-7"]').text
date = re.split(r"Submitted|;",date)[1]
tag = article.find_element_by_tag_name('div[class="tags is-inline-block"]').text.replace('\n', ',')
try:
doi = article.find_element_by_tag_name('div[class="tags has-addons"]').text
doi = re.split(r'\s', doi)[1]
except NoSuchElementException:
doi = 'None'
all_combined = [title, arxiv_id, arxiv_link, pdf_link, authors, abstract, date, tag, doi]
dfs.append(all_combined)
print('Finished Extracting Page:', i)
try:
link2 = browser.find_element_by_class_name('pagination-next')
link2.click()
except:
browser.close
time.sleep(0.1)
The following implementation achieves this in 16 seconds.
To speed up the execution process I have taken the following measures:
Removed Selenium entirely (No clicking required)
For abstract, used BeautifulSoup's output and processed it later
Added multiprocessing to speed up the process significantly
from multiprocessing import Process, Manager
import requests
from bs4 import BeautifulSoup
import re
import time
start_time = time.time()
def get_no_of_pages(showing_text):
no_of_results = int((re.findall(r"(\d+,*\d+) results for all",showing_text)[0].replace(',','')))
pages = no_of_results//200 + 1
print("total pages:",pages)
return pages
def clean(text):
return text.replace("\n", '').replace(" ",'')
def get_data_from_page(url,page_number,data):
print("getting page",page_number)
response = requests.get(url+"start="+str(page_number*200))
soup = BeautifulSoup(response.content, "lxml")
arxiv_results = soup.find_all("li",{"class","arxiv-result"})
for arxiv_result in arxiv_results:
paper = {}
paper["titles"]= clean(arxiv_result.find("p",{"class","title is-5 mathjax"}).text)
links = arxiv_result.find_all("a")
paper["arxiv_ids"]= links[0].text.replace('arXiv:','')
paper["arxiv_links"]= links[0].get('href')
paper["pdf_link"]= links[1].get('href')
paper["authors"]= clean(arxiv_result.find("p",{"class","authors"}).text.replace('Authors:',''))
split_abstract = arxiv_result.find("p",{"class":"abstract mathjax"}).text.split("▽ More\n\n\n",1)
if len(split_abstract) == 2:
paper["abstract"] = clean(split_abstract[1].replace("△ Less",''))
else:
paper["abstract"] = clean(split_abstract[0].replace("△ Less",''))
paper["date"] = re.split(r"Submitted|;",arxiv_results[0].find("p",{"class":"is-size-7"}).text)[1]
paper["tag"] = clean(arxiv_results[0].find("div",{"class":"tags is-inline-block"}).text)
doi = arxiv_results[0].find("div",{"class":"tags has-addons"})
if doi is None:
paper["doi"] = "None"
else:
paper["doi"] = re.split(r'\s', doi.text)[1]
data.append(paper)
print(f"page {page_number} done")
if __name__ == "__main__":
url = 'https://arxiv.org/search/?searchtype=all&query=healthcare&abstracts=show&size=200&order=-announced_date_first&'
response = requests.get(url+"start=0")
soup = BeautifulSoup(response.content, "lxml")
with Manager() as manager:
data = manager.list()
processes = []
get_data_from_page(url,0,data)
showing_text = soup.find("h1",{"class":"title is-clearfix"}).text
for i in range(1,get_no_of_pages(showing_text)):
p = Process(target=get_data_from_page, args=(url,i,data))
p.start()
processes.append(p)
for p in processes:
p.join()
print("Number of entires scraped:",len(data))
stop_time = time.time()
print("Time taken:", stop_time-start_time,"seconds")
Output:
>>> python test.py
getting page 0
page 0 done
total pages: 10
getting page 1
getting page 4
getting page 2
getting page 6
getting page 5
getting page 3
getting page 7
getting page 9
getting page 8
page 9 done
page 4 done
page 1 done
page 6 done
page 2 done
page 7 done
page 3 done
page 5 done
page 8 done
Number of entires scraped: 1890
Time taken: 15.911492586135864 seconds
Note:
Please write the above code in a .py file. For Jupyter notebook refer this.
Multiprocessing code taken from here.
The ordering of entries in the data list won't match the ordering on the website as Manager will append dictionaries into it as they come.
The above code finds the number of pages on its own and is thus generalized to work on any arxiv search result. Unfortunately, to do this it first gets page 0 and then calculates the number of pages and then goes for multiprocessing for the remaining pages. This has the disadvantage that while the 0th page was being worked on, no other process was running. So if you remove that part and simply run the loop for 10 pages then the time taken should fall at around 8 seconds.
You can try with request and beautiful soup approach. No need to click more link.
from requests import get
from bs4 import BeautifulSoup
# you can change the size to retrieve all the results at one shot.
url = 'https://arxiv.org/search/?query=healthcare&searchtype=all&abstracts=show&order=-announced_date_first&size=50&start=0'
response = get(url,verify = False)
soup = BeautifulSoup(response.content, "lxml")
#print(soup)
queryresults = soup.find_all("li", attrs={"class": "arxiv-result"})
for result in queryresults:
title = result.find("p",attrs={"class": "title is-5 mathjax"})
print(title.text)
#If you need full abstract content - try this (you do not need to click on more button
for result in queryresults:
abstractFullContent = result.find("span",attrs={"class": "abstract-full has-text-grey-dark mathjax"})
print(abstractFullContent.text)
Output:
Interpretable Deep Learning for Automatic Diagnosis of 12-lead Electrocardiogram
Leveraging Technology for Healthcare and Retaining Access to Personal Health Data to Enhance Personal Health and Well-being
Towards new forms of particle sensing and manipulation and 3D imaging on a smartphone for healthcare applications
I am trying to scrape data from the PGA.com website to get a table of all of the golf courses in the United States. In my CSV table I want to include the Name of the golf course ,Address ,Ownership ,Website , Phone number. With this data I would like to geocode it and place into a map and have a local copy on my computer
I utilized Python and Beautiful Soup4 to extract my data. I have reached as far to extract the data and import it into a CSV but I am now having a problem of scraping data from multiple pages on the PGA website. I want to extract ALL THE GOLF COURSES but my script is limited only to one page I want to loop it in away that it will capture all data for golf courses from all pages found in the PGA site. There are about 18000 gold courses and 900 pages to capture data
Attached below is my script. I need help on creating code that will capture ALL data from the PGA website and not just one site but multiple. In this manner it will provide me with all the data of gold courses in the United States.
Here is my script below:
import csv
import requests
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
r = requests.get(url)
soup = BeautifulSoup(r.content)
g_data1=soup.find_all("div",{"class":"views-field-nothing-1"})
g_data2=soup.find_all("div",{"class":"views-field-nothing"})
courses_list=[]
for item in g_data2:
try:
name=item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
except:
name=''
try:
address1=item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
except:
address1=''
try:
address2=item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
except:
address2=''
try:
website=item.contents[1].find_all("div",{"class":"views-field-website"})[0].text
except:
website=''
try:
Phonenumber=item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text
except:
Phonenumber=''
course=[name,address1,address2,website,Phonenumber]
courses_list.append(course)
with open ('filename5.csv','wb') as file:
writer=csv.writer(file)
for row in courses_list:
writer.writerow(row)
#for item in g_data1:
#try:
#print item.contents[1].find_all("div",{"class":"views-field-counter"})[0].text
#except:
#pass
#try:
#print item.contents[1].find_all("div",{"class":"views-field-course-type"})[0].text
#except:
#pass
#for item in g_data2:
#try:
#print item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
#except:
#pass
#try:
#print item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
#except:
#pass
#try:
#print item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
#except:
#pass
This script only captures 20 at a time and I want to capture all in one script which account for 18000 golf courses and 900 pages to scrape form.
The PGA website's search have multiple pages, the url follows the pattern:
http://www.pga.com/golf-courses/search?page=1 # Additional info after page parameter here
this means you can read the content of the page, then change the value of page by 1, and read the the next page.... and so on.
import csv
import requests
from bs4 import BeautifulSoup
for i in range(907): # Number of pages plus one
url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content)
# Your code for each individual page here
if you still read this post , you can try this code too....
from urllib.request import urlopen
from bs4 import BeautifulSoup
file = "Details.csv"
f = open(file, "w")
Headers = "Name,Address,City,Phone,Website\n"
f.write(Headers)
for page in range(1,5):
url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(page)
html = urlopen(url)
soup = BeautifulSoup(html,"html.parser")
Title = soup.find_all("div", {"class":"views-field-nothing"})
for i in Title:
try:
name = i.find("div", {"class":"views-field-title"}).get_text()
address = i.find("div", {"class":"views-field-address"}).get_text()
city = i.find("div", {"class":"views-field-city-state-zip"}).get_text()
phone = i.find("div", {"class":"views-field-work-phone"}).get_text()
website = i.find("div", {"class":"views-field-website"}).get_text()
print(name, address, city, phone, website)
f.write("{}".format(name).replace(",","|")+ ",{}".format(address)+ ",{}".format(city).replace(",", " ")+ ",{}".format(phone) + ",{}".format(website) + "\n")
except: AttributeError
f.close()
where it is written range(1,5) just change that with 0,to the last page , and you will get all details in CSV, i tried very hard to get your data in proper format but it's hard:).
You're putting a link to a single page, it's not going to iterate through each one on its own.
Page 1:
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
Page 2:
http://www.pga.com/golf-courses/search?page=1&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0
Page 907:
http://www.pga.com/golf-courses/search?page=906&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0
Since you're running for page 1 you'll only get 20. You'll need to create a loop that'll run through each page.
You can start off by creating a function that does one page then iterate that function.
Right after the search? in the url, starting at page 2, page=1 begins increasing until page 907 where it's page=906.
I noticed that the first solution had a repetition of the first instance, that is because the 0 page and 1 page is the same page. This is resolved by specifying the start page in the range function. Example below...
for i in range(1, 907): #Number of pages plus one
url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content, "html5lib") #Can use whichever parser you prefer
# Your code for each individual page here
Had this same exact problem and the solutions above did not work. I solved mine by accounting for cookies. A requests session helps. Create a session and it'll pull all the pages you need by inserting a cookie to all the numbered pages.
import csv
import requests
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
s = requests.Session()
r = s.get(url)
The PGA website has changed this question has been asked.
It seems they organize all courses by: State > City > Course
In light of this change and the popularity of this question, here's how I'd solve this problem today.
Step 1 - Import everything we'll need:
import time
import random
from gazpacho import Soup # https://github.com/maxhumber/gazpacho
from tqdm import tqdm # to keep track of progress
Step 2 - Scrape all the state URL endpoints:
URL = "https://www.pga.com"
def get_state_urls():
soup = Soup.get(URL + "/play")
a_tags = soup.find("ul", {"data-cy": "states"}, mode="first").find("a")
state_urls = [URL + a.attrs['href'] for a in a_tags]
return state_urls
state_urls = get_state_urls()
Step 3 - Write a function to scrape all the city links:
def get_state_cities(state_url):
soup = Soup.get(state_url)
a_tags = soup.find("ul", {"data-cy": "city-list"}).find("a")
state_cities = [URL + a.attrs['href'] for a in a_tags]
return state_cities
state_url = state_urls[0]
city_links = get_state_cities(state_url)
Step 4 - Write a function to scrape all of the courses:
def get_courses(city_link):
soup = Soup.get(city_link)
courses = soup.find("div", {"class": "MuiGrid-root MuiGrid-item MuiGrid-grid-xs-12 MuiGrid-grid-md-6"}, mode="all")
return courses
city_link = city_links[0]
courses = get_courses(city_link)
Step 5 - Write a function to parse all the useful info about a course:
def parse_course(course):
return {
"name": course.find("h5", mode="first").text,
"address": course.find("div", {'class': "jss332"}, mode="first").strip(),
"url": course.find("a", mode="first").attrs["href"]
}
course = courses[0]
parse_course(course)
Step 6 - Loop through everything and save:
all_courses = []
for state_url in tqdm(state_urls):
city_links = get_state_cities(state_url)
time.sleep(random.uniform(1, 10) / 10)
for city_link in city_links:
courses = get_courses(city_link)
time.sleep(random.uniform(1, 10) / 10)
for course in courses:
info = parse_course(course)
all_courses.append(info)