I am trying to scrape Backcountry.com review section. The site uses a dynamic load more section, ie the url doesn't change when you want to load more reviews. I am using Selenium webdriver to interact with the button that loads more review and BeautifulSoup to scrape the reviews.
I was able to successfully interact with the load more button and load all the reviews available. I was also able to scrape the initial reviews that appear before you try the load more button.
IN SUMMARY: I can interact with the load more button, I can scrape the initial reviews available but I cannot scrape all the reviews that are available after I load all.
I have tried to change the html tags to see if that makes a difference. I have tried to increase the sleep time in case the scraper didn't have enough time to complete its job.
# URL and Request code for BeautifulSoup
url_filter_bc = 'https://www.backcountry.com/msr-miniworks-ex-ceramic-water-filter?skid=CAS0479-CE-ONSI&ti=U2VhcmNoIFJlc3VsdHM6bXNyOjE6MTE6bXNy'
res_filter_bc = requests.get(url_filter_bc, headers = {'User-agent' : 'notbot'})
# Function that scrapes the reivews
def scrape_bc(request, website):
newlist = []
soup = BeautifulSoup(request.content, 'lxml')
newsoup = soup.find('div', {'id': 'the-wall'})
reviews = newsoup.find('section', {'id': 'wall-content'})
for row in reviews.find_all('section', {'class': 'upc-single user-content-review review'}):
newdict = {}
newdict['review'] = row.find('p', {'class': 'user-content__body description'}).text
newdict['title'] = row.find('h3', {'class': 'user-content__title upc-title'}).text
newdict['website'] = website
newlist.append(newdict)
df = pd.DataFrame(newlist)
return df
# function that uses Selenium and combines that with the scraper function to output a pandas Dataframe
def full_bc(url, website):
driver = connect_to_page(url, headless=False)
request = requests.get(url, headers = {'User-agent' : 'notbot'})
time.sleep(5)
full_df = pd.DataFrame()
while True:
try:
loadMoreButton = driver.find_element_by_xpath("//a[#class='btn js-load-more-btn btn-secondary pdp-wall__load-more-btn']")
time.sleep(2)
loadMoreButton.click()
time.sleep(2)
except:
print('Done Loading More')
# full_json = driver.page_source
temp_df = pd.DataFrame()
temp_df = scrape_bc(request, website)
full_df = pd.concat([full_df, temp_df], ignore_index = True)
time.sleep(7)
driver.quit()
break
return full_df
I expect a pandas dataframe with 113 rows and three columns.
I am getting a pandas datafram with 18 rows and three columns.
Ok, you clicked loadMoreButton and loaded more reviews. But you keep feeding to scrape_bc the same request content you downloaded once, totally separately from Selenium.
Replace requests.get(...) with driver.page_source and ensure you have driver.page_source in a loop before scrape_bc(...) call
request = driver.page_source
temp_df = pd.DataFrame()
temp_df = scrape_bc(request, website)
Related
I am trying to scrape some ETF stock information from https://etfdb.com/etfs/sector/technology/#etfs&sort_name=assets_under_management&sort_order=desc&page=1 as a personal project.
What I am trying to do is scrape the tables shown for each of the pages but it seems to always return the same values even though I update the page number in the url. Is there some sort of limitation or something to do with the webpage that I am not considering? What can I do to scrape the tables from pages 1 through 5 from the above link?
The code that I am trying to use as follows:
import pandas as pd
import requests
def etf_table_scraper(industry):
# instatiate empty dataframe
df = pd.DataFrame()
# cycle through the pages
for page in range(1, 10):
url = f"https://etfdb.com/etfs/sector/{industry}/#etfs__returns&sort_name=symbol&sort_order=asc&page={page}"
r = requests.get(url)
df_list = pd.read_html(r.text)[0] # this parses all the tables in webpages to a list
# if first page, append
if page == 1:
df = df.append(df_list[0].iloc[:-1])
# otherwise check to see if there are overlaps
elif df_list.loc[0, 'Symbol'] not in df['Symbol'].unique():
df = df.append(df_list.iloc[:-1])
else:
break
return df
So I saw the same issue as you when using requests. I was able to work around this though using Selenium though and clicking the next page button. Here's some sample code, you'd need to rework it to your flow though as this was just used for testing.
from selenium import webdriver
from time import sleep
import random
df = pd.DataFrame()
driver=webdriver.Chrome(executable_path="C:\chromedriver_win32\chromedriver.exe") ## Add your own path here
driver.get("https://etfdb.com/etfs/sector/technology/#etfs&sort_name=assets_under_management&sort_order=desc&page=1")
sleep(2)
text = driver.page_source # Get page source to get table
table_pg1 = pd.read_html(text)[0].iloc[:-1]
df = df.append(table_pg1)
sleep(2)
for i in range(1, 4):
driver.find_element_by_xpath('//*[#id="featured-wrapper"]/div[1]/div[4]/div[1]/div[2]/div[2]/div[2]/div[4]/div[2]/ul/li[8]/a').click()# Click next page button
sleep(3)
text = driver.page_source
table_pg_i = pd.read_html(text)[0].iloc[:-1]
df = df.append(table_pg_i)
driver.close()
CODE IS HERE
Hi guys
I have some problem with scraping this dynamic site (https://kvartiry-bolgarii.ru/)
I need to get all the links to the home sale ads
I used selenium to load the page and get links to ads after that I move the page down to load new ads. After the new ads are loaded, I start to parse all the links on the page and write them to the list again.
But the data in the list is not updated and the script continues to work with the links that were on the page before scrolling down.
By the way, I set a check so that the script is executed until the last announcement on the site appears in the list, the link to which I found out in advance
How can this problem be corrected?
def get_link_info():
try:
url = "https://kvartiry-bolgarii.ru/"
driver = webdriver.Chrome(
executable_path=r'C:\Users\kk\Desktop\scrape_house\drivers\chromedriver.exe',
options=options
)
driver.get(url)
req = requests.get(url)
req.encoding = 'utf8'
soup = BeautifulSoup(req.text, "lxml")
articles = soup.find_all("div", class_="content")
links_urls = []
for article in articles:
house_url = article.find("a").get("href")
links_urls.append(house_url)
#print(links_urls)
first_link_number = links_urls[-2].split("-")[-1]
first_link_number = first_link_number[1:]
#print(first_link_number)
last_link_number = links_urls[-1].split("-")[-1]
last_link_number = last_link_number[1:]
#print(last_link_number)
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)
check = "https://kvartiry-bolgarii.ru/kvartira-v-elitnom-komplekse-s-unikalynym-sadom-o21751"
for a in links_urls:
if a != check:
for article in articles:
house_url = article.find("a").get("href")
links_urls.append(house_url)
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)
print(links_urls[-1])
else:
print(links_urls[0], links_urls[-1])
print("all links are ready")
Some pointers. You don't need to mix selenium,requests and BeautifulSoup. Just selenium is enough. When you are scrolling infinitely, you need to remove duplicate elements before adding them to your list.
You can try this. This should work.
from selenium import webdriver
import time
def get_link_info():
all_links = []
try:
driver = webdriver.Chrome(executable_path='C:/chromedriver.exe')
driver.get('https://kvartiry-bolgarii.ru/')
time.sleep(3)
old_links = set() # Empty Set
while True:
# Scroll to get more ads
driver.execute_script("window.scrollBy(0,3825)", "")
# Wait for new ads to load
time.sleep(8)
links_divs = driver.find_elements_by_xpath('//div[#class="content"]//a') # Find Elements
ans = set(links_divs) - set(old_links) # Remove old elements
for link in ans:
# Scroll to the link.
driver.execute_script("arguments[0].scrollIntoView();", link)
fir = link.get_attribute('href')
all_links.append(fir)
# Remove Duplicates
old_links = links_divs
except Exception as e:
raise e
get_link_info()
Trying to scrape two tables on separate pages after accessing the site through a login. Tried a few different ways and can't figure it out.
The last attempt showed some promise but only the first data frame was appended to the list of data frames. Something like the following:
from selenium import webdriver
import pandas as pd
import requests
import time
from bs4 import BeautifulSoup as BS
def text_to_chart (url, table) :
df_list = []
driver = webdriver.Chrome(path)
driver.get(login)
driver.find_element_by_xpath(password block).send_keys(password)
driver.find_element_by_xpath(username block).send_keys(username)
driver.find_element_by_xpath(submit).click()
time.sleep(10)
df = pd.DataFrame()
for url, table in zip(urls, tables) :
driver.get(url)
time.sleep(10)
soup = BS(driver.page_source, 'html')
new_table = soup.find_all('table',
attrs = {'class': table})
results_list = pd.read_html(str(new_table[0]))
df = df.append(pd.DataFrame(results_list[0]))
return df
def scrape(url, table)
df_list = []
df_list = df_list.append(text_to_chart(url, table))
scrape(url_list, table_list)
So, What Should I do to scrape multiple pages?
I suggest you must store the values in a list of dictionaries and then convert it to a dataframe.That will be good and easy.
Solved! I made a few changes which resulted in one function that created my list of df's. Then I began the session, logged in, and called the function, saving the output to my variable df_list.
from selenium import webdriver
import pandas as pd
import requests
import time
from bs4 import BeautifulSoup as BS
def text_to_chart (urls, tables) :
df = []
for url, table in zip(urls, tables) :
driver.get(url)
time.sleep(10)
soup = BS(driver.page_source, 'html')
new_table = soup.find_all('table',
attrs = {'class': table})
results_list = pd.read_html(str(new_table[0]))
df.append(pd.DataFrame(results_list[0]))
return df
driver = webdriver.Chrome(path)
driver.get(login)
driver.find_element_by_xpath(password block).send_keys(password)
driver.find_element_by_xpath(username block).send_keys(username)
driver.find_element_by_xpath(submit).click()
time.sleep(10)
df_list = text_to_chart(url_list, table_list)
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'http://dciindia.gov.in/DentistsSearch.aspx?Reg_Type=D&RegUnder=0&IDRId=&IDRName=&CourseId=0&RegDate=0&CouncilId='
html = requests.get(url).text
soup = BeautifulSoup(html,'html.parser')
table = soup.find('table',{'id':'gvSearchDentistlist'})
try:
rows = table.find_all('tr')
for row in rows:
if len(row.find_all('td')) == 6:
data = row.find_all('td')
name = data[1].text.strip()
print("NAME:"+name)
root_url = data[5].input['onclick'].split(",")[4]
link ='http://dciindia.gov.in/'+root_url
print("LINK:"+link)
except:
pass
I wrote this code but its giving output for only first page i want to run this code for all pages in the above site what to do? Please help
The problem is the use of javascript __doPostBack in your webpage. Since no one pointed out on selenium as an alternative, here is an example of clicking the pages on your webpage using selenium:
from bs4 import BeautifulSoup
from selenium import webdriver
def your_func(html):
soup = BeautifulSoup(html,'html.parser')
table = soup.find('table',{'id':'gvSearchDentistlist'})
try:
rows = table.find_all('tr')
for row in rows:
if len(row.find_all('td')) == 6:
data = row.find_all('td')
name = data[1].text.strip()
print("NAME:"+name)
root_url = data[5].input['onclick'].split(",")[4]
link ='http://dciindia.gov.in/'+root_url
print("LINK:"+link)
except:
pass
url = 'http://dciindia.gov.in/DentistsSearch.aspx?Reg_Type=D&RegUnder=0&IDRId=&IDRName=&CourseId=0&RegDate=0&CouncilId='
driver = webdriver.Chrome(executable_path=r'path\chromedriver.exe')
driver.maximize_window()
# first page
driver.get(url)
html = driver.page_source
your_func(html)
# page 2
nextPage = driver.find_element_by_xpath('/html/body/form/div[3]/div/table/tbody/tr[5]/td/fieldset/div/table/tbody/tr[52]/td/table/tbody/tr/td[2]/a')
nextPage.click()
html = driver.page_source
your_func(html)
# page 3
nextPage = driver.find_element_by_xpath('//*[#id="gvSearchDentistlist"]/tbody/tr[52]/td/table/tbody/tr/td[3]/a')
nextPage.click()
html = driver.page_source
your_func(html)
I am trying to download the data on this website
https://coinmunity.co/
...in order to manipulate later it in Python or Pandas
I have tried to do it directly to Pandas via Requests, but did not work, using this code:
res = requests.get("https://coinmunity.co/")
soup = BeautifulSoup(res.content, 'lxml')
table = soup.find_all('table')[0]
dfm = pd.read_html(str(table), header = 0)
dfm = dfm[0].dropna(axis=0, thresh=4)
dfm.head()
In most of the things I tried, I could only get to the info in the headers, which seems to be the only table seen in this page by the code.
Seeing that this did not work, I tried to do the same scraping with Requests and BeautifulSoup, but it did not work either. This is my code:
import requests
from bs4 import BeautifulSoup
res = requests.get("https://coinmunity.co/")
soup = BeautifulSoup(res.content, 'lxml')
#table = soup.find_all('table')[0]
#table = soup.find_all('div', {'class':'inner-container'})
#table = soup.find_all('tbody', {'class':'_ngcontent-c0'})
#table = soup.find_all('table')[0].findAll('tr')
#table = soup.find_all('table')[0].find('tbody')#.find_all('tbody _ngcontent-c3=""')
table = soup.find_all('p', {'class':'stats change positiveSubscribers'})
You can see in the lines commented, all the things I have tried, but nothing worked.
Is there any way to easily download that table to use it on Pandas/Python, in the tidiest, easier and quickest possible way?
Thank you
Since the content is loaded dynamically after the initial request is made, you won't be able to scrape this data with request. Here's what I would do instead:
from selenium import webdriver
import pandas as pd
import time
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
driver.implicitly_wait(10)
driver.get("https://coinmunity.co/")
html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html, 'lxml')
results = []
for row in soup.find_all('tr')[2:]:
data = row.find_all('td')
name = data[1].find('a').text
value = data[2].find('p').text
# get the rest of the data you need about each coin here, then add it to the dictionary that you append to results
results.append({'name':name, 'value':value})
df = pd.DataFrame(results)
df.head()
name value
0 NULS 14,005
1 VEN 84,486
2 EDO 20,052
3 CLUB 1,996
4 HSR 8,433
You will need to make sure that geckodriver is installed and that it is in your PATH. I just scraped the name of each coin and the value but getting the rest of the information should be easy.