Fetch only specific links using selenium in python - python

I am trying to fetch the links of all news articles related to Apple, using this webpage: https://finance.yahoo.com/quote/AAPL/news?p=AAPL. But there are also a lot of links for advertisements in between and other links guiding to other pages of the website. How do I selectively only fetch links to news articles?
Here is the code I have written so far:
driver = webdriver.Chrome(executable_path='C:\\Users\\Home\\OneDrive\\Desktop\\AJ\\chromedriver_win32\\chromedriver.exe')
driver.get("https://finance.yahoo.com/quote/AAPL/news?p=AAPL")
links=[]
for a in driver.find_elements_by_xpath('.//a'):
links.append(a.get_attribute('href'))
def get_info(url):
#send request
response = requests.get(url)
#parse
soup = BeautifulSoup(response.text)
#get information we need
news = soup.find('div', attrs={'class': 'caas-body'}).text
headline = soup.find('h1').text
date = soup.find('time').text
return news, headline, date
Can anyone guide on how to do this or to a resource that can help with this? Thanks!

Try this xpath to get all the news links from that page.
//li[contains(#class,'js-stream-content')]/div[#data-test-locator='mega']//h3/a
driver.implicitly_wait(10)
driver.maximize_window()
driver.get("https://finance.yahoo.com/quote/AAPL/news?p=AAPL")
time.sleep(10)
links = driver.find_elements_by_xpath("//li[contains(#class,'js-stream-content')]/div[#data-test-locator='mega']//h3/a")
for link in links:
print(link.get_attribute("href"))

Related

Cannot scrape google patent URL through python and Beautiful Soup

I am currently trying to scrape a link to Google Patents on this page,
https://datatool.patentsview.org/#detail/patent/10745438, but when I am trying to print out all of the links with an 'a' tag, only an unrelated website comes up.
Here is my code so far:
url = 'https://datatool.patentsview.org/#detail/patent/10745438'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
links = []
print(soup)
for link in soup.find_all('a', href=True):
print(link['href'])
When I print out the soup, the 'a' tag with the link to the google patents isn't printed, nor is the link in the array. The only thing printed is
http://uspto.gov/
tel:1-800-786-9199
./#viz/relationships
./#viz/locations
./#viz/comparisons
, which is all unnecessary information. Is google protecting their links in some way, or is there any other way I can retrieve the link to the google patent or redirect to the page?
Don't scrape it, just do some link hacking:
url = 'https://datatool.patentsview.org/#detail/patent/10745438'
google_patents_url = 'https://www.google.com/patents/US' + url.rsplit('/', 1)[1]

Accessing all elements from main website page with Beautiful Soup

I want to scrape news from this website:
https://www.bbc.com/news
You can see that website has categories such as Home, US Election, Coronavirus etc.
For example, If I go to specific news article such as:
https://www.bbc.com/news/election-us-2020-54912611
I can write a scraper that will give me the headline, this is the code:
from bs4 import BeautifulSoup
response = requests.get("https://www.bbc.com/news/election-us-2020-54912611", headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.select("header h1")
print(title)
On this website there are hundreds of news, so my question is, Is there a way to access each news article thats on the website (all categories) from the home page url? On home page I cant see all news articles, I can see only some of them, so is there a way for me to load whole HTML code for whole website, so that I can easily get all news headlines with:
soup.select("header h1")
Ok, then after getting this headlines you can also have another links in this page, you just again open that links and fetch information from that links it can look like this:
visited = set()
links = [....]
while links:
if link_for_fetch in visited:
continue
link_for_fetch = links.pop()
content = get_contents(link_for_fetch)
headlines += parse_headlines()
links += parse_links()
visited.add(link_for_fetch)
it's just pseudocode, you can write in any programming language. but this can take a lot of time for parsing whole site :( and robots can block your ip address

What would be the best way to scrape this website? (Not Selenium)

Before I begin TLDR is at the bottom
So I'm trying to scrape https://rarbgmirror.com/ for torrent magnet links and for their torrent title names based on user inputted searches. I've already figured out how to do this using BeautifulSoup and Requests through this code:
from bs4 import BeautifulSoup
import requests
import re
query = input("Input a search: ")
link = 'https://rarbgmirror.com/torrents.php?search=' + query
magnets = []
titles = []
try:
request = requests.get(link)
except:
print("ERROR")
source = request.text
soup = BeautifulSoup(source, 'lxml')
for page_link in soup.findAll('a', attrs={'href': re.compile("^/torrent/")}):
page_link = 'https://www.1377x.to/' + page_link.get('href')
try:
page_request = requests.get(page_link)
except:
print("ERROR")
page_source = page_request.content
page_soup = BeautifulSoup(page_source, 'lxml')
link = page_soup.find('a', attrs={'href': re.compile("^magnet")})
magnets.append(link.get('href'))
title = page_soup.find('h1')
titles.append(title)
print(titles)
print(magnets)
I am almost certain that this code has no error in it because the code was originally made for https://1377x.to for the same purpose, and if you look through the HTML structure of both websites, they use the same tags for magnet links and title names. But if the code is faulty please point that out to me!
After some research I found the issue to be that https://rarbgmirror.com/ uses JavaScript which dynamically loads web pages. So after some more research I find that selenium is recommended for this purpose. Well after some time using selenium I find some cons to using it such as:
The slow speed of scraping
The system which the app is running on must have the selenium browser installed (I'm planning on using pyinstaller to pack the app which would be an issue)
So I'm requesting for an alternative to selenium to scrape dynamically loaded web pages.
TLDR:
I want an alternative to selenium to scrape a website which is dynamically loaded using JavaScript.
PS: GitHub Repo:
https://github.com/eliasbenb/MagnetMagnet
If you are using only Chrome, you can check out Puppeteer by Google. It is fast and integrates quite well with Chrome DevTools.
WORKING SOLUTION
DISCLAIMER FOR PEOPLE LOOKING FOR AN ANSWER: this method WILL NOT work for any website other than RARBG
I posted this same question to reddit's r/learnpython someone on there found a great answer which met all my requirements. You can find the original comment here
What he found out was that rarbg gets its info from here
You can change what is searcher by changing "QUERY" in the link. On that page was all the information for each torrent, so using requests and bs4 I scraped all the information.
Here is the working code:
query = input("Input a search: ")
rarbg_link = 'https://torrentapi.org/pubapi_v2.php?mode=search&search_string=' + query + '&token=lnjzy73ucv&format=json_extended&app_id=lol'
try:
request = requests.get(rarbg_link, headers={'User-Agent': 'Mozilla/5.0'})
except:
print("ERROR")
source = request.text
soup = str(BeautifulSoup(source, 'lxml'))
soup = soup.replace('<html><body><p>{"torrent_results":[', '')
soup = soup.split(',')
titles = str([i for i in soup if i.startswith('{"title":')])
titles = titles.replace('{"title":"', '')
titles = titles.replace('"', '')
titles = titles.split("', '")
for title in titles:
title.append(titles)
links = str([i for i in soup if i.startswith('"download":')])
links = links.replace('"download":"', '')
links = links.replace('"', '')
links = links.split("', '")
for link in links:
magnets.append(link)

Extract urls from the footer of a web page

I am trying to extract the social media links from websites for my research unfortunately, I am not able to extract them as they are located in the footer of the website.
I tried requests, urllib.request, pattern.web apis to download the html document of a webpage. All these apis download the same content and failing to download the content in the footer of the websites.
import requests
from bs4 import BeautifulSoup as soup
url = 'https://cloudsight.ai/'
headers = {'User-Agent':'Mozilla/5.0'}
sm_sites = ['https://www.twitter.com','https://www.facebook.com',
'https://www.youtube.com','https://www.linkedin.com',
'https://www.linkedin.com/company', 'https://twitter.com',
'https://facebook.com','https://youtube.com','https://linkedin.com',
'http://www.twitter.com','http://www.facebook.com',
'http://www.youtube.com','http://www.linkedin.com',
'http://www.linkedin.com/company', 'http://twitter.com',
'http://facebook.com','http://youtube.com','http://linkedin.com']
blocked = ['embed','search','sharer','intent','share','watch']
sm_sites_present = []
r = requests.get(url,headers=headers)
content = soup(r.content,'html.parser')
text = r.text
links = content.find_all('a',href=True)
for link in links:
a = link.attrs['href'].strip('/')
try:
if any(site in a for site in sm_sites) and not any(block in a for block in blocked):
sm_sites_present.append(a)
except:
sm_sites_present.append(None)
output:
>>> sm_sites_present
>>> []
If you see the website inspect element the social_media information is provided in the footer div DOM.
If you just even try text.find('footer') the result is -1.
I tried for many hours to figure out how to extract this footer information and I failed.
SO, I kindly request if anyone could help me in solving it.
Note:
Even I tried regex, the problem is the when we download the page the footer information is not being downloaded.
As suggested by #chitown88, you can use Selenium to get the content.
from selenium import webdriver
url = 'https://cloudsight.ai/'
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
driver.quit()
soup = BeautifulSoup(html,'html.parser')
[i.a['href'] for i in soup.footer.find_all('li', {'class':'social-list__item'})]
output
['https://www.linkedin.com/company/cloudsight-inc',
'https://www.facebook.com/CloudSight',
'https://twitter.com/CloudSightAPI']

Scrape embedded tweets from a webpage with Selenium and BeautifulSoup

I need to extract tweets embedded in text articles. The problem with the pages I'm testing is that they load tweets in ~5 out of 10 runs. So I need to use Selenium to wait for the page to load but I cannot make it work. I followed steps from their official website:
url = 'https://www.bbc.co.uk/news/world-us-canada-44648563'
options = webdriver.ChromeOptions()
options.add_argument("headless")
driver = webdriver.Chrome(executable_path='/Users/ME/Downloads/chromedriver', chrome_options=options)
driver.implicitly_wait(15)
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, "lxml")
tweets_soup = [s.get_text() for s in soup.find_all('p', {'dir': 'ltr'})]
tweets = '\n'.join(tweets_soup)
print(tweets)
I cannot use the option to wait for a certain element to appear because I'm scanning different pages and not all of them have embedded tweets. So to check if Selenium actually works or not I run the above script together with the script which doesn't use Selenium and compare their results:
url = 'https://www.bbc.co.uk/news/world-us-canada-44648563'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
tweets_soup = [s.get_text() for s in soup.find_all('p', {'dir': 'ltr'})]
tweets = '\n'.join(tweets_soup)
print(tweets)
I will really appreciate the help of this wonderful community!

Categories