I am trying to scrape some ETF stock information from https://etfdb.com/etfs/sector/technology/#etfs&sort_name=assets_under_management&sort_order=desc&page=1 as a personal project.
What I am trying to do is scrape the tables shown for each of the pages but it seems to always return the same values even though I update the page number in the url. Is there some sort of limitation or something to do with the webpage that I am not considering? What can I do to scrape the tables from pages 1 through 5 from the above link?
The code that I am trying to use as follows:
import pandas as pd
import requests
def etf_table_scraper(industry):
# instatiate empty dataframe
df = pd.DataFrame()
# cycle through the pages
for page in range(1, 10):
url = f"https://etfdb.com/etfs/sector/{industry}/#etfs__returns&sort_name=symbol&sort_order=asc&page={page}"
r = requests.get(url)
df_list = pd.read_html(r.text)[0] # this parses all the tables in webpages to a list
# if first page, append
if page == 1:
df = df.append(df_list[0].iloc[:-1])
# otherwise check to see if there are overlaps
elif df_list.loc[0, 'Symbol'] not in df['Symbol'].unique():
df = df.append(df_list.iloc[:-1])
else:
break
return df
So I saw the same issue as you when using requests. I was able to work around this though using Selenium though and clicking the next page button. Here's some sample code, you'd need to rework it to your flow though as this was just used for testing.
from selenium import webdriver
from time import sleep
import random
df = pd.DataFrame()
driver=webdriver.Chrome(executable_path="C:\chromedriver_win32\chromedriver.exe") ## Add your own path here
driver.get("https://etfdb.com/etfs/sector/technology/#etfs&sort_name=assets_under_management&sort_order=desc&page=1")
sleep(2)
text = driver.page_source # Get page source to get table
table_pg1 = pd.read_html(text)[0].iloc[:-1]
df = df.append(table_pg1)
sleep(2)
for i in range(1, 4):
driver.find_element_by_xpath('//*[#id="featured-wrapper"]/div[1]/div[4]/div[1]/div[2]/div[2]/div[2]/div[4]/div[2]/ul/li[8]/a').click()# Click next page button
sleep(3)
text = driver.page_source
table_pg_i = pd.read_html(text)[0].iloc[:-1]
df = df.append(table_pg_i)
driver.close()
Related
**
import pandas as pd
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium import webdriver
for c in range(1,20):
linkall = f'https://www.coingecko.com/?page={c}'
driver = webdriver.Chrome()
driver.get(linkall)
elem = driver.find_element(By.TAG_NAME, 'table')
head = elem.find_element(By.TAG_NAME, 'thead')
body = elem.find_element(By.TAG_NAME, 'tbody')
list_rows = []
for items in body.find_elements(By.TAG_NAME, 'tr'):
list_cells = []
for item in items.find_elements(By.TAG_NAME, 'td'):
list_cells.append(item.text)
list_rows.append(list_cells)
driver.close()
print(list_rows)
print(len(list_rows))
df = pd.DataFrame(list_rows)
df.to_csv('coingeckod.csv')
**
Here I want to scrape multiple web pages table data by python selenium. But I have only got last web page table data. There are 19 webpage . Each web page have 100 rows. 19*100 = 1900. But I have only get 100 rows of last page. When scraping here, the data of all web pages are scraped and shown in the terminal! But the data of 19 web pages is not saved in my csv file! Only the data of the last 1 web page has been saved! I want to save all desired 19 web pages data serially in csv file! Where did my code go wrong? Did I have mistake to append? Please experts correct me! Prepare me such a code! As if I get 1900 rows of 19 web pages serially!
list list_rows = [] should declare before the outer for loop.
You have declared inside outer for..loop and each iterations it is assigned back
to empty list and that's the reason you are getting only final page value.
code block:
list_rows = []
for c in range(1,20):
linkall = f'https://www.coingecko.com/?page={c}'
driver = webdriver.Chrome()
driver.get(linkall)
elem = driver.find_element(By.TAG_NAME, 'table')
head = elem.find_element(By.TAG_NAME, 'thead')
body = elem.find_element(By.TAG_NAME, 'tbody')
for items in body.find_elements(By.TAG_NAME, 'tr'):
list_cells = []
for item in items.find_elements(By.TAG_NAME, 'td'):
list_cells.append(item.text)
list_rows.append(list_cells)
driver.close()
print(list_rows)
print(len(list_rows))
df = pd.DataFrame(list_rows)
df.to_csv('coingeckod.csv')
My code goes into a website, and clicks on records which causes drop downs.
My current code only prints the first drop down record, and not the others.
For example, the first record of the webpage when clicked, drops down 1 record. This record is shown attached. This is also the first and only dropdown record that gets printed as my output.
The code prints this
How do I get it to pull all drop down titles?
from selenium import webdriver
import time
driver = webdriver.Chrome()
for x in range (1,2):
driver.get(f'https://library.iaslc.org/conference-program?product_id=24&author=&category=&date=&session_type=&session=&presentation=&keyword=&available=&cme=&page={x}')
time.sleep(4)
productlist_length = len(driver.find_elements_by_xpath("//div[#class='accordin_title']"))
for i in range(1, productlist_length + 1):
product = driver.find_element_by_xpath("(//div[#class='accordin_title'])[" + str(i) + "]")
title = product.find_element_by_xpath('.//h4').text.strip()
print(title)
buttonToClick = product.find_element_by_xpath('.//div[#class="sign"]')
buttonToClick.click()
time.sleep(5)
subProduct=driver.find_element_by_xpath(".//li[#class='sub_accordin_presentation']")
otherTitle=subProduct.find_element_by_xpath('.//h4').text.strip()
print(otherTitle)
You don't need selenium at all. Not sure what all the info is that you are after but the following shows you that the content is available, from within those expand blocks, with the response from a simple requests.get().:
import requests
from bs4 import BeautifulSoup as bs
import re
r = requests.get('https://library.iaslc.org/conference-program?product_id=24&author=&category=&date=&session_type=&session=&presentation=&keyword=&available=&cme=&page=1')
soup = bs(r.text, 'lxml')
sessions = soup.select('#accordin > ul > li')
for session in sessions:
print(session.select_one('h4').text)
sub_session = session.select('.sub_accordin_presentation')
if sub_session:
print([re.sub(r'[\n\s]+', ' ', i.text) for i in sub_session])
print()
print()
Try:
productlist_length = len(driver.find_elements_by_xpath('//*[#class="jscroll-inner"]/ul/li'))
for product in productlist_length:
title = product.find_element_by_xpath('(.//*[#class="accordin_title"]/div)[3]/h4').text
I'm currently using this code to web scrape reviews from TrustPilot. I wish to adjust the code to scrape reviews from (https://boxes.mysubscriptionaddiction.com/box/boxycharm?ratings=true#review-update-create). However, unlike most other review sites, the reviews are not separated into multiple sub-pages but there is instead a button at the end of the page to "view more reviews" which shows 3 additional reviews whenever you press it.
Is it possible to adjust the code such that it is able to scrape all the reviews from this particular product within the website with this kind of web structure?
from bs4 import BeautifulSoup
import requests
import pandas as pd
import json
print ('all imported successfuly')
# Initialize an empty dataframe
df = pd.DataFrame()
for x in range(1, 44):
names = []
headers = []
bodies = []
ratings = []
published = []
updated = []
reported = []
link = (f'https://www.trustpilot.com/review/birchbox.com?page={x}')
print (link)
req = requests.get(link)
content = req.content
soup = BeautifulSoup(content, "lxml")
articles = soup.find_all('article', {'class':'review'})
for article in articles:
names.append(article.find('div', attrs={'class': 'consumer-information__name'}).text.strip())
headers.append(article.find('h2', attrs={'class':'review-content__title'}).text.strip())
try:
bodies.append(article.find('p', attrs={'class':'review-content__text'}).text.strip())
except:
bodies.append('')
try:
#ratings.append(article.find('div', attrs={'class':'star-rating star-rating--medium'}).text.strip())
#ratings.append(article.find('div', attrs={'class': 'star-rating star-rating--medium'})['alt'])
ratings.append(article.find_all("img", alt=True)[0]["alt"])
except:
ratings.append('')
dateElements = article.find('div', attrs={'class':'review-content-header__dates'}).text.strip()
jsonData = json.loads(dateElements)
published.append(jsonData['publishedDate'])
updated.append(jsonData['updatedDate'])
reported.append(jsonData['reportedDate'])
# Create your temporary dataframe of the first iteration, then append that into your "final" dataframe
temp_df = pd.DataFrame({'User Name': names, 'Header': headers, 'Body': bodies, 'Rating': ratings, 'Published Date': published, 'Updated Date':updated, 'Reported Date':reported})
df = df.append(temp_df, sort=False).reset_index(drop=True)
print ('pass1')
df.to_csv('BirchboxReviews2.0.csv', index=False, encoding='utf-8')
print ('excel done')
Basically you are dealing with a website which is dynamically loaded via JavaScript code once the page loads, where the comments is rendered with JS code on each scroll down.
I've been able to navigate to the XHR request which obtain the Comments from JS and I've been able to call it and retrieve all comments you asked for.
You don't need to use selenium as it's will slow down your task process.
Here you can achieve your target. assuming that each page include 3 comments. so we just math it to work on the full pages.
import requests
from bs4 import BeautifulSoup
import math
def PageNum():
r = requests.get(
"https://boxes.mysubscriptionaddiction.com/box/boxycharm?ratings=true#review-update-create")
soup = BeautifulSoup(r.text, 'html.parser')
num = int(
soup.find("a", class_="show-more-reviews").text.split(" ")[3][1:-1])
if num % 3 == 0:
return (num / 3) + 1
else:
return math.ceil(num / 3) + 2
def Main():
num = PageNum()
headers = {
'X-Requested-With': 'XMLHttpRequest'
}
with requests.Session() as req:
for item in range(1, num):
print(f"Extracting Page# {item}")
r = req.get(
f"https://boxes.mysubscriptionaddiction.com/get_user_reviews?box_id=105&page={item}", headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
for com in soup.findAll("div", class_=r'\"comment-body\"'):
print(com.text[5:com.text.find(r"\n", 3)])
Main()
Simple of the output:
Number of Pages 49
Extracting Page# 1
****************************************
I think Boxycharm overall is the best beauty subscription. However, I think it's
ridiculous that if you want to upgrade you have to pay the 25 for the first box and then add additional money to get the premium. Even though it's only one time,
that's insane. So about 80 bucks just to switch to Premium. And suppose U do that and then my Boxy Premium shows up at my door. I open it ....and absolutely hate
the majority if everything I have. Yeah I would be furious! Not worth taking a chance on. Boxy only shows up half the time with actual products or colors I use.
I love getting the monthly boxes, just wish they would have followed my preferences for colors!
I used to really get excited for my boxes. But not so much anymore. This months
Fenty box choices lack! I am not a clown
Extracting Page# 2
****************************************
Love it its awsome
Boxycharm has always been a favorite subscription box, I’ve had it off and on , love most of the goodies. I get frustrated when they don’t curate it to fit me and or customer service isn’t that helpful but overall a great box’!
I like BoxyCharm but to be honest I feel like some months they don’t even look at your beauty profile because I sometimes get things I clearly said I wasn’t interested in getting.
Extracting Page# 3
****************************************
The BEST sub box hands down.
I love all the boxy charm boxes everything is amazing all full size products and
the colors are outstanding
I absolutely love Boxycharm. I have received amazing high end products. My makeup cart is so full I have such a variety everyday. I love the new premium box and paired with Boxyluxe I recieve 15 products for $85 The products are worth anywhere from $500 to $700 total. I used to spend $400 a month buying products at Ulta. I would HIGHLY recommend this subscription.
Also I have worked out the code for your website. It uses selenium for button clicks and scrolling do let me know if you have any doubts. I still suggest you go through the article first:-
# -*- coding: utf-8 -*-
"""
Created on Sun Mar 8 18:09:45 2020
#author: prakharJ
"""
from selenium import webdriver
import time
import pandas as pd
names_found = []
comments_found = []
ratings_found = []
dateElements_found = []
# Web extraction of web page boxes
print("scheduled to run boxesweb scrapper ")
driver = webdriver.Chrome(executable_path='Your/path/to/chromedriver.exe')
webpage = 'https://boxes.mysubscriptionaddiction.com/box/boxycharm?ratings=true#review-update-create'
driver.get(webpage)
SCROLL_PAUSE_TIME = 6
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight*0.80);")
time.sleep(SCROLL_PAUSE_TIME)
try:
b = driver.find_element_by_class_name('show-more-reviews')
b.click()
time.sleep(SCROLL_PAUSE_TIME)
except Exception:
s ='no button'
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
names_list = driver.find_elements_by_class_name('name')
comment_list = driver.find_elements_by_class_name('comment-body')
rating_list = driver.find_elements_by_xpath("//meta[#itemprop='ratingValue']")
date_list = driver.find_elements_by_class_name('comment-date')
for names in names_list:
names_found.append(names.text)
for bodies in comment_list:
try:
comments_found.append(bodies.text)
except:
comments_found.append('NA')
for ratings in rating_list:
try:
ratings_found.append(ratings.get_attribute("content"))
except:
ratings_found.append('NA')
for dateElements in date_list:
dateElements_found.append(dateElements.text)
# Create your temporary dataframe of the first iteration, then append that into your "final" dataframe
temp_df = pd.DataFrame({'User Name': names_found, 'Body': comments_found, 'Rating': ratings_found, 'Published Date': dateElements_found})
#df = df.append(temp_df, sort=False).reset_index(drop=True)
print('extraction completed for the day and system goes into sleep mode')
driver.quit()
I am trying to scrape Backcountry.com review section. The site uses a dynamic load more section, ie the url doesn't change when you want to load more reviews. I am using Selenium webdriver to interact with the button that loads more review and BeautifulSoup to scrape the reviews.
I was able to successfully interact with the load more button and load all the reviews available. I was also able to scrape the initial reviews that appear before you try the load more button.
IN SUMMARY: I can interact with the load more button, I can scrape the initial reviews available but I cannot scrape all the reviews that are available after I load all.
I have tried to change the html tags to see if that makes a difference. I have tried to increase the sleep time in case the scraper didn't have enough time to complete its job.
# URL and Request code for BeautifulSoup
url_filter_bc = 'https://www.backcountry.com/msr-miniworks-ex-ceramic-water-filter?skid=CAS0479-CE-ONSI&ti=U2VhcmNoIFJlc3VsdHM6bXNyOjE6MTE6bXNy'
res_filter_bc = requests.get(url_filter_bc, headers = {'User-agent' : 'notbot'})
# Function that scrapes the reivews
def scrape_bc(request, website):
newlist = []
soup = BeautifulSoup(request.content, 'lxml')
newsoup = soup.find('div', {'id': 'the-wall'})
reviews = newsoup.find('section', {'id': 'wall-content'})
for row in reviews.find_all('section', {'class': 'upc-single user-content-review review'}):
newdict = {}
newdict['review'] = row.find('p', {'class': 'user-content__body description'}).text
newdict['title'] = row.find('h3', {'class': 'user-content__title upc-title'}).text
newdict['website'] = website
newlist.append(newdict)
df = pd.DataFrame(newlist)
return df
# function that uses Selenium and combines that with the scraper function to output a pandas Dataframe
def full_bc(url, website):
driver = connect_to_page(url, headless=False)
request = requests.get(url, headers = {'User-agent' : 'notbot'})
time.sleep(5)
full_df = pd.DataFrame()
while True:
try:
loadMoreButton = driver.find_element_by_xpath("//a[#class='btn js-load-more-btn btn-secondary pdp-wall__load-more-btn']")
time.sleep(2)
loadMoreButton.click()
time.sleep(2)
except:
print('Done Loading More')
# full_json = driver.page_source
temp_df = pd.DataFrame()
temp_df = scrape_bc(request, website)
full_df = pd.concat([full_df, temp_df], ignore_index = True)
time.sleep(7)
driver.quit()
break
return full_df
I expect a pandas dataframe with 113 rows and three columns.
I am getting a pandas datafram with 18 rows and three columns.
Ok, you clicked loadMoreButton and loaded more reviews. But you keep feeding to scrape_bc the same request content you downloaded once, totally separately from Selenium.
Replace requests.get(...) with driver.page_source and ensure you have driver.page_source in a loop before scrape_bc(...) call
request = driver.page_source
temp_df = pd.DataFrame()
temp_df = scrape_bc(request, website)
This is my first time trying to use python with selenium and bs4.
I'm trying to scrape data from this website
To begin I select GE from cantone dropdown menu, click the checkbox "Conffermo" and the button "Ricerca". Then I can see the data. I have to click each arrow to expand the data and scrape it from every person (this is a loop, isn't it). And then do the same on the next page (by clicking on "Affiggere le seguenti entrate" at the bottom of the page)
I'd like to use relative xpath for the data since not all persons have all the data (I'd like to put an empty cell in excel when data is missing)
This is my code so far:
import urllib2
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
browser = webdriver.Firefox()
URL = 'http://www.asca.ch/Partners.aspx?lang=it'
time.sleep(10)
page = urllib2.urlopen(quote_page) # query the website and return the html to the variable ‘page’
soup = BeautifulSoup(page, ‘html.parser’)
inputElementCantone = driver.find_element_by_xpath(//*[#id="ctl00_MainContent_ddl_cantons_Input"]).click()
browser.find_element_by_xpath(/html/body/form/div[1]/div/div/ul/li[9]).click()
browser.find_element_by_xpath(//INPUT[#id='MainContent__chkDisclaimer']).click()
driver.find_element_by_xpath(//INPUT[#id='MainContent_btn_submit']).click()
arrow = browser.find_element_by_class_name("footable-toggle")
I'm stuck after this. The data I'd like to scrape (in excel columns) are: Discipline(s) thérapeutique(s), Cognome, Cellulare and email.
Any help is appreciated.
# To find all the table
table = soup.find('table', {'class': 'footable'})
# To get all rows in that table
rows = table.find_all('tr')
# A function to process each row
def processRow(row):
#All rows with hidden data
dataFields = row.find_all('td', {'style': True}
output = {}
#Fixed index numbers are not ideal but in this case will work
output['Discipline'] = dataFields[0].text
output['Cogome'] = dataFields[2].text
output['Cellulare'] = dataFields[8].text
output['email'] = dataFields[10].text
return output
#Declaring a list to store all results
results = []
#Iterating over all the rows and storing the processed result in a list
for row in rows:
results.append(processRow(row))
print(results)