Scrape embedded tweets from a webpage with Selenium and BeautifulSoup - python

I need to extract tweets embedded in text articles. The problem with the pages I'm testing is that they load tweets in ~5 out of 10 runs. So I need to use Selenium to wait for the page to load but I cannot make it work. I followed steps from their official website:
url = 'https://www.bbc.co.uk/news/world-us-canada-44648563'
options = webdriver.ChromeOptions()
options.add_argument("headless")
driver = webdriver.Chrome(executable_path='/Users/ME/Downloads/chromedriver', chrome_options=options)
driver.implicitly_wait(15)
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, "lxml")
tweets_soup = [s.get_text() for s in soup.find_all('p', {'dir': 'ltr'})]
tweets = '\n'.join(tweets_soup)
print(tweets)
I cannot use the option to wait for a certain element to appear because I'm scanning different pages and not all of them have embedded tweets. So to check if Selenium actually works or not I run the above script together with the script which doesn't use Selenium and compare their results:
url = 'https://www.bbc.co.uk/news/world-us-canada-44648563'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
tweets_soup = [s.get_text() for s in soup.find_all('p', {'dir': 'ltr'})]
tweets = '\n'.join(tweets_soup)
print(tweets)
I will really appreciate the help of this wonderful community!

Related

Looping through pages of search result

I am trying to scrape Reuters image captions on certain pictures. I have searched with my parameters and have a search result with 182 pages. The 'PN=X' part at the end of the links are the page numbers. I have built a for loop to loop through the pages and scrape all captions:
pages = ['https://pictures.reuters.com/CS.aspx?VP3=SearchResult&VBID=2C0BXZS52QWLHI&SMLS=1&RW=1920&RH=688#/SearchResult&VBID=2C0BXZS52QWLHI&SMLS=1&RW=1920&RH=688&PN=1',
'https://pictures.reuters.com/CS.aspx?VP3=SearchResult&VBID=2C0BXZS52QWLHI&SMLS=1&RW=1920&RH=688#/SearchResult&VBID=2C0BXZS52QWLHI&SMLS=1&RW=1920&RH=688&PN=2',
'https://pictures.reuters.com/CS.aspx?VP3=SearchResult&VBID=2C0BXZS52QWLHI&SMLS=1&RW=1920&RH=688#/SearchResult&VBID=2C0BXZS52QWLHI&SMLS=1&RW=1920&RH=688&PN=3',
'https://pictures.reuters.com/CS.aspx?VP3=SearchResult&VBID=2C0BXZS52QWLHI&SMLS=1&RW=1920&RH=688#/SearchResult&VBID=2C0BXZS52QWLHI&SMLS=1&RW=1920&RH=688&PN=4', ...]
complete_captions = []
for link in pages:
page = requests.get(link)
soup = BeautifulSoup(page.content, 'html.parser')
for element in soup.find_all(id=re.compile("CaptionLong_Lbl")):
if not element.text.endswith('...'):
complete_captions.append(element.text)
The code runs, but it returns the same captions regardless of the page it is given. It just repeats the same 47 results over and over again. But when I enter the pages into my browser, they are different from each other. So it should give different results. Any idea how to fix?
For this website, to get different results for each page is more complicated than just adding a page number to the URL and using requests.get().
A simpler approach in this case would be to use selenium, for example:
from bs4 import BeautifulSoup
import re
import time
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
options = Options()
options.headless = True
browser = webdriver.Firefox(options=options)
complete_captions = []
for page_number in range(1, 5):
print(f"Page {page_number}")
url = f'https://pictures.reuters.com/CS.aspx?VP3=SearchResult&VBID=2C0BXZS52QWLHI&SMLS=1&RW=1920&RH=688#/SearchResult&VBID=2C0BXZS52QWLHI&SMLS=1&RW=1920&RH=688&PN={page_number}'
browser.get(url)
time.sleep(1)
soup = BeautifulSoup(browser.page_source, 'html.parser')
for element in soup.find_all(id=re.compile("CaptionLong_Lbl")):
if not element.text.endswith('...'):
complete_captions.append(element.text)
#print(element.text)
browser.quit()
Obviously, a different browser can be used.

BeatifulSoap find() returns "None" with any name/attributes

I'm trying to get some informations about a product i'm interested in, on Amazon.
I'm using BeatifulSoap library for webscraping :
URL = 'https://www.amazon.it/gp/offer-listing/B08KHL2J5X/ref=dp_olp_unknown_mbc'
page = requests.get(URL,headers=headers)
soup = BeautifulSoup(page.content,'html.parser')
title = soup.find('span',class_='a-size-large a-color-price olpOfferPrice a-text-bold')
print(title)
In the pic, the highlined row it's the one i want to select, but when i run my script i get 'None' everytime. (Printing the entire output after BeatifulSoap call, give me the entire HTML source, so i'm using the right URL)
Any solutions?
You need to use .text() to get the text of an element.
so change:
print(title)
to:
print(title.text)
Output:
EUR 1.153,00
I wouldn't use BS alone in this case. You can easily use add Selenium to scrape the website:
from bs4 import BeautifulSoup
from selenium.webdriver.firefox.options import Options as FirefoxOptions
from selenium import webdriver
url = 'https://www.amazon.it/gp/offer-listing/B08KHL2J5X/ref=dp_olp_unknown_mbc'
driver = webdriver.Safari()
driver.get(url)
html_content = driver.page_source
soup = BeautifulSoup(html_content, "html.parser")
title = soup.find('span',class_='a-size-large a-color-price olpOfferPrice a-text-bold')
print(title)
If you don't can use Safari you have to download the webdriver for Chrome, Firefox etc. but there is plenty of reading material on this topic.

scrape multiple pages with static url

I've asked a similar question about navigating multiple pages with static url from https://ethnicelebs.com/all-celeb and thanks for help! But now I'd like to scrape all ethnicity information of every character listed by clicking at each name. I can navigate all pages right now but my code keeps scraping information from the very first page.
I've tried the following:
url = 'https://ethnicelebs.com/all-celeb'
driver = webdriver.Chrome()
driver.get(url)
while True:
page = requests.post('https://ethnicelebs.com/all-celebs')
soup = BeautifulSoup(page.text, 'html.parser')
for href in soup.find_all('a', href=True)[18:]:
print('Found the URL:{}'.format(href['href']))
request_href = requests.get(href['href'])
soup2 = BeautifulSoup(request_href.content)
for each in soup2.find_all('strong')[:-1]:
print(each.text)
Next_button = (By.XPATH, "//*[#title='Go to next page']")
WebDriverWait(driver, 50).until(EC.element_to_be_clickable(Next_button)).click()
url = driver.current_url
time.sleep(5)
(Thanks to #Sureshmani!)
I expect the code to scrape each page while navigating instead of only the first page. How can I scrape the current page while it keeps navigating? Thanks!
I misunderstood your question due to the nested loop in the previous answer. The following code would work:
url = 'https://ethnicelebs.com/all-celeb'
driver = webdriver.Chrome()
while True:
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
for href in soup.find_all('a', href=True)[18:]:
print('Found the URL:{}'.format(href['href']))
driver.get(href['href'])
soup2 = BeautifulSoup(driver.page_source)
for each in soup2.find_all('strong')[:-1]:
print(each.text)
Next_button = (By.XPATH, "//*[#title='Go to next page']")
WebDriverWait(driver, 50).until(EC.element_to_be_clickable(Next_button)).click()
url = driver.current_url
time.sleep(5)
In your code, you only send a request thru the selenium once at the beginning, and then use requests later. To navigate and scrape a page at the same time, you should only use selenium like the example above.

Selenium: different content of url between selenium result and browser

I'm trying to parse this url
First, I've tried to use requests with bs4, but result page was differ from content from browser.
cont = requests.get(path).content
soup = BeautifulSoup(r, "html.parser")
print(soup.prettify())
Next I try to use selenium:
def render_page(path):
driver = webdriver.PhantomJS()
driver.get(path)
time.sleep(3)
r = driver.page_source
return r
r = render_page(path)
soup = BeautifulSoup(r, "html.parser")
print(soup.prettify())
But it returns another content. Сontent of page
After it I've tried add to my code
js_code = "return document.getElementsByTagName('html').innerHTML"
your_elements = sel.execute_script(js_code)
but it didn't help.
So Are any ways to get content of page using requests or selenium or maybe some another parser the same as in the browser?

Beautifulsoup not returning complete HTML of the page

I have been digging on the site for some time and im unable to find the solution to my issue. Im fairly new to web scraping and trying to simply extract some links from a web page using beautiful soup.
url = "https://www.sofascore.com/pt/futebol/2018-09-18"
page = urlopen(url).read()
soup = BeautifulSoup(page, "lxml")
print(soup)
At the most basic level, all im trying to do is access a specific tag within the website. I can work out the rest for myself, but the part im struggling with is the fact that a tag that I am looking for is not in the output.
For example: using the built in find() I can grab the following div class tag:
class="l__grid js-page-layout"
However what i'm actually looking for are the contents of a tag that is embedded at a lower level in the tree.
js-event-list-tournament-events
When I perform the same find operation on the lower-level tag, I get no results.
Using Azure-based Jupyter Notebook, i have tried a number of the solutions to similar problems on stackoverflow and no luck.
Thanks!
Kenny
The page use JS to load the data dynamically so you have to use selenium. Check below code.
Note you have to install selenium and chromedrive (unzip the file and copy into python folder)
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "https://www.sofascore.com/pt/futebol/2018-09-18"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
container = soup.find_all('div', attrs={
'class':'js-event-list-tournament-events'})
print(container)
or you can use their json api
import requests
url = 'https://www.sofascore.com/football//2018-09-18/json'
r = requests.get(url)
print(r.json())
I had the same problem and the following code worked for me. Chromedriver must be installed!
import time
from bs4 import BeautifulSoup
from selenium import webdriver
chromedriver_path= "/Users/.../chromedriver"
driver = webdriver.Chrome(chromedriver_path)
url = "https://yourURL.com"
driver.get(url)
time.sleep(3) #if you want to wait 3 seconds for the page to load
page_source = driver.page_source
soup = bs4.BeautifulSoup(page_source, 'lxml')
This soup you can use as usual.

Categories