Beautiful Soup not working on this website - python

I want to scrape the URLs of all the items in the table but when I try, nothing comes up. The code is quite basic so I can see why it might not work. However, even trying to scrape the title of this website, nothing comes up. I at least expected the h1 tag as it's outside the table...
Website: https://www.vanguard.com.au/personal/products/en/overview
import requests
from bs4 import BeautifulSoup
lists =[]
url = 'https://www.vanguard.com.au/personal/products/en/overview'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
title = soup.find_all('h1', class_='heading2 gbs-font-vanguard-red')
for links in soup.find_all('a', style='padding-bottom: 1px;'):
link_text = links['href']
lists.append(link_text)
print(title)
print(lists)

If the problem is caused by the JavaScript eventlistener, I would suggest you use beautifulsoup along with selenium to scrape this website. So, let's apply selenium at sending request and get back page source and then use beautifulsoup to parse it.
In addition, you should use title = soup.find() instead of title = soup.findall() in order to get only one title.
The example of code using Firefox:
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from bs4 import BeautifulSoup
url = 'https://www.vanguard.com.au/personal/products/en/overview'
browser = webdriver.Firefox(executable_path=GeckoDriverManager().install())
browser.get(url)
soup = BeautifulSoup(browser.page_source, 'html.parser')
browser.close()
lists =[]
title = soup.find('h1', class_='heading2 gbs-font-vanguard-red')
for links in soup.find_all('a', style='padding-bottom: 1px;'):
link_text = links['href']
lists.append(link_text)
print(title)
print(lists)
Output:
<h1 class="heading2 gbs-font-vanguard-red">Investment products</h1>
['/personal/products/en/detail/8132', '/personal/products/en/detail/8219', '/personal/products/en/detail/8121',...,'/personal/products/en/detail/8217']

The most common problem (with many modern pages): this page uses JavaScript to add elements but requests/BeautifulSoup can't run JavaScript.
You may need to use Selenium to control real web browser which can run JavaScript.
This example use only Selenium without BeautifulSoup
I use xpath but you may also use css selector.
from selenium import webdriver
from selenium.webdriver.common.by import By
url = 'https://www.vanguard.com.au/personal/products/en/overview'
lists = []
#driver = webdriver.Chrome(executable_path="/path/to/chromedrive.exe")
driver = webdriver.Firefox(executable_path="/path/to/geckodrive.exe")
driver.get(url)
title = driver.find_element(By.XPATH, '//h1[#class="heading2 gbs-font-vanguard-red"]')
print(title.text)
all_items = driver.find_elements(By.XPATH, '//a[#style="padding-bottom: 1px;"]')
for links in all_items:
link_text = links.get_attribute('href')
print(link_text)
lists.append(link_text)
ChromeDriver (for Chrome)
GeckoDriver (for Firefox)

It's always more efficient to get the data from the source as opposed to doing it through Selenium. Looks like the links are created through the portId.
import pandas as pd
import requests
url = 'https://www3.vanguard.com.au/personal/products/funds.json'
payload = {
'context': '/personal/products/',
'countryCode': 'au.ret',
'paths': "[[['funds','legacyFunds'],'AU']]",
'method': 'get'}
jsonData = requests.get(url, params=payload).json()
results = jsonData['jsonGraph']['funds']['AU']['value']
df1 = pd.json_normalize(results, record_path=['children'])
df2 = pd.json_normalize(results, record_path=['listings'])
df = pd.concat([df1, df2], axis=0)
df['url_link'] = 'https://www.vanguard.com.au/personal/products/en/detail/' + df['portId'] + '/Overview'
Output:
print(df[['fundName', 'url_link']])
fundName url_link
0 Vanguard Active Emerging Market Equity Fund https://www.vanguard.com.au/personal/products/...
1 Vanguard Active Global Credit Bond Fund https://www.vanguard.com.au/personal/products/...
2 Vanguard Active Global Growth Fund https://www.vanguard.com.au/personal/products/...
3 Vanguard Australian Corporate Fixed Interest I... https://www.vanguard.com.au/personal/products/...
4 Vanguard Australian Fixed Interest Index Fund https://www.vanguard.com.au/personal/products/...
.. ... ...
23 Vanguard MSCI Australian Small Companies Index... https://www.vanguard.com.au/personal/products/...
24 Vanguard MSCI Index International Shares (Hedg... https://www.vanguard.com.au/personal/products/...
25 Vanguard MSCI Index International Shares ETF https://www.vanguard.com.au/personal/products/...
26 Vanguard MSCI International Small Companies In... https://www.vanguard.com.au/personal/products/...
27 Vanguard International Credit Securities Hedge... https://www.vanguard.com.au/personal/products/...
[66 rows x 2 columns]

Related

How I can input/search text in Javascript base websites using selinium python?

I want to input result in search field and get the Eircode/zipcode/postalcode from output page. e.g: https://eircode-finder.com/search/
and search list of addresses like: 8 old bawn court tallaght dublin
and from results I want to fetch Eircode/zipcode/postalcode and save it in a .txt file
I have used beautifulsoup to fetch data, but Its not fetching even the html of the page.I don't know details but something is on the website like javascript which is preventing me to get data from that website.
You can use next example how to make a request to this page api:
import requests
import pandas as pd
url = "https://geocode.search.hereapi.com/v1/geocode"
to_search = [
"Coolboy Wicklow",
"8 old bawn court tallaght dublin",
]
headers = {"Referer": "https://eircode-finder.com/"}
params = {
"q": "",
"lang": "en",
"in": "countryCode:IRL",
"apiKey": "BegLfP-EDdyWflI0fRrP3HJ7IDSK_0878_n2fbct1wE",
}
def get_item(q):
params["q"] = q
data = requests.get(url, params=params, headers=headers).json()
out = []
for i in data["items"]:
out.append([i["title"], i["address"].get("postalCode")])
return out
all_data = []
for q in to_search:
all_data += get_item(q)
df = pd.DataFrame(all_data, columns=["title", "postal_code"])
df = df.drop_duplicates()
print(df.to_markdown(index=False))
Prints:
title
postal_code
Coolboy, Arklow, County Wicklow, Ireland
8 Old Bawn Court, Dublin, County Dublin, D24 N1YH, Ireland
D24 N1YH
Mentioned website is developed using react which requires javascript engine for rendering html pages.
Beautiful Soup just sends the request and take the response if it's normal website response will be HTML or else it will be JSON data which require javascript engine to render.
Website which require javascript engine can be scrapped with selenium as it uses actual browser to request and load the page.
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
path = r"./chromedriver.exe"
driver = webdriver.Chrome(path)
url = "https://eircode-finder.com/search/"
driver.get(url)
search_input=driver.find_element_by_id("outlined")
search_input.send_keys("8 old bawn court tallaght dublin") # add text which you want to send
search_input.send_keys(Keys.ENTER)
time.sleep(10) # for page to load
eircode=driver.find_element_by_css_selector("#root > div:nth-child(2) > div > div.MuiBox-root.jss12 > div > div > div > div:nth-child(1) > div.MuiBox-root.jss13 > div > div > h3 > div")
print(eircode.text)
time.sleep(10) # buffer
# you can pass this page source to beautiful soup and scrap it
# or you can continue scrapping with selenium.
soup = BeautifulSoup(driver.page_source, 'html.parser')
print(driver.page_source)
You can check out this video which can help you understand better

How to open scraped links one by one automatically using Python?

So here is my situation: Let's say you search on eBay for "Motorola DynaTAC 8000x". The bot that I build is going to scrape all the links of the listings. My goal is now, to make it open those scraped links one by one.
I think something like that would be possible with using loops, but I am not sure on how to do it. Thanks in advance!
Here is the code of the bot:
import requests
from bs4 import BeautifulSoup
url = "https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=Motorola+DynaTAC+8000x&_sacat=0"
r = requests.get(url)
soup = BeautifulSoup(r.content, features="lxml")
listings = soup.select("li a")
for a in listings:
link = a["href"]
if link.startswith("https://www.ebay.com/itm/"):
print(link)
To get information from the link you can do:
import requests
from bs4 import BeautifulSoup
url = "https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=Motorola+DynaTAC+8000x&_sacat=0"
r = requests.get(url)
soup = BeautifulSoup(r.content, features="lxml")
listings = soup.select("li a")
for a in listings:
link = a["href"]
if link.startswith("https://www.ebay.com/itm/"):
s = BeautifulSoup(requests.get(link).content, "lxml")
price = s.select_one('[itemprop="price"]')
print(s.h1.text)
print(price.text if price else "-")
print(link)
print("-" * 80)
Prints:
...
Details about  MOTOROLA DYNATAC 8100L- BRICK CELL PHONE VINTAGE RETRO RARE MUSEUM 8000X
GBP 555.00
https://www.ebay.com/itm/393245721991?hash=item5b8f458587:g:c7wAAOSw4YdgdvBt
--------------------------------------------------------------------------------
Details about  MOTOROLA DYNATAC 8100L- BRICK CELL PHONE VINTAGE RETRO RARE MUSEUM 8000X
GBP 555.00
https://www.ebay.com/itm/393245721991?hash=item5b8f458587:g:c7wAAOSw4YdgdvBt
--------------------------------------------------------------------------------
Details about  Vintage Pulsar Extra Thick Brick Cell Phone Has Dynatac 8000X Display
US $3,000.00
https://www.ebay.com/itm/163814682288?hash=item26241daeb0:g:sTcAAOSw6QJdUQOX
--------------------------------------------------------------------------------
...

unable to scrape date/time info using Beautifulsoup

I am trying to web scrape an upcoming event date on reuters.com using Python and Beautifulsoup package.
Unfortunately it seems harder than expected to get out the upcoming earnings event date and time from HTML.
I do not understand why I cannot get a visible output via the below script although I can see the value while web inspecting the target URL. Does anybody know why? Is there any viable work-around?
header = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:70.0) Gecko/20100101 Firefox/70.0', }
URL = f'https://www.reuters.com/companies/SAPG.DE/events'
page = requests.get(URL, headers=header)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='__next')
job_elems = results.find_all('section', class_='Events-section-2YwsJ')
for job_elem in job_elems:
event_type = job_elem.find('h3').text
if event_type.find('Events') != -1:
print(job_elem.find('h3').text)
items = job_elem.find_all('div', class_='EventList-event-Veu-f')
for item in items:
title = item.find('span').text
earnings_time = item.find('time').get_text()
if title.find('Earnings Release') != -1:
print(earnings_time)
The attributes class of the "object" in question is EventList-date-cLNT9 which I have never seen before.
This happens as time tag is using js to load, but bs4 uses html,, you have 2 options :
one is to use selenium ,or to use their API.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
URL = f'https://www.reuters.com/companies/SAPG.DE/events'
page = driver.get(URL)
soup = BeautifulSoup(driver.page_source, 'html.parser')
results = soup.find(id='__next')
job_elems = results.find_all('section', class_='Events-section-2YwsJ')
for job_elem in job_elems:
event_type = job_elem.find('h3').text
if event_type.find('Events') != -1:
print(job_elem.find('h3').text)
items = job_elem.find_all('div', class_='EventList-event-Veu-f')
for item in items:
title = item.find('span').text
time = item.find('time').text
print(f"Title: {title}, Time: {time}")
driver.quit()
output :
Upcoming Events
Title: SAP SE at Morgan Stanley Technology, Media and Telecom Conference (Virtual), Time: 1 Mar 2021 / 6PM EET
Title: Q1 2021 SAP SE Earnings Release, Time: 22 Apr 2021 / 8AM EET
The reason for that is those events are added dynamically by JavaScript, which means that they are not visible in the HTML you get back.
However, there's an API you can query to get the events
Here's how:
import requests
api_url = "https://www.reuters.com/companies/api/getFetchCompanyEvents/SAPG.DE"
response = requests.get(api_url).json()
for event in response["market_data"]["upcoming_event"]:
print(f"{event['name']} - {event['time']}")
Output:
SAP SE at Morgan Stanley Technology, Media and Telecom Conference (Virtual) - 2021-03-01T16:45:00Z
Q1 2021 SAP SE Earnings Release - 2021-04-22T06:30:00Z

how do i get the next tag

I am trying to get the headlines that are in between a class. the headlines are wrapped around the h2 tag. headlines come after the tag.
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.dailypost.ng/hot-news")
soup = BeautifulSoup(r.content, "html.parser")
mydivs = soup.findAll("span", {"class": "mvp-cd-date left relative"})
mytags = mydivs.findNext('h2')
for tag in mytags:
print(tag.text.strip())
You must iterate through mydivs to use findNext()
mydivs is a list of web elements. findNextonly applies to a single web element. You must iterate through the divs and run findNext on each of them.
Just add this line
for div in mydivs:
and put it before
mytags = div.findNext('h2')
Here is the full code for your working program:
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.dailypost.ng/hot-news")
soup = BeautifulSoup(r.content, "html.parser")
mydivs = soup.findAll("span", {"class": "mvp-cd-date left relative"})
for div in mydivs:
mytags = div.findNext('h2')
for tag in mytags:
print(tag.strip())
Try replacing the last 3 lines with:
for div in mydivs:
mytags = div.findNext('h2')
for tag in mytags:
print(tag.strip())
soup.findAll() returns a list (or None), so you cannot call findNext() on it. However, you can iterate the tags and call find_next() on each tag separately:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.dailypost.ng/hot-news")
soup = BeautifulSoup(r.content, "html.parser")
mydivs = soup.findAll("span", {"class": "mvp-cd-date left relative"})
for tag in mydivs:
print(tag.find_next('h2').get_text(strip=True))
Prints:
BREAKING: Another federal lawmaker dies in Dubai hospital
Cross-Over Night: Enugu Govt bans burning of tyres on roads
Dadiyata: DSS breaks silence as Nigerian govt critic remains missing
CAC: Nigerian govt appoints new Acting Registrar-General
What Buhari told me – Dabiri-Erewa
What soldiers should expect in 2020 – Buratai
Only earthquake can erase Amosun’s legacies in Ogun – Akinlade
Civil War: Militia leader sentenced to 20yrs in prison
2020: Prophet Omale releases prophecies on Buhari, Aisha, Kyari, govs, coup plot
BREAKING: EFCC arrests Shehu Sani
Armed Forces Day: Yobe Governor Buni, donates N40 million for emblem appeal fund
Zamfara govt bans illegal gathering in the state
Agbenu Kacholalo: Colours of culture at Idoma International Carnival 2019 [PHOTOS]
Men of God are too fearful, weak to challenge government activities
2020: Peter Obi sends message to Nigerians
TETFUND: EFCC, ICPC asked to probe agency over alleged corruption
Two inmates regain freedom from Uyo prison
Buhari meets President of AfDB, Adeshina at Aso Rock
New Kogi CP resumes office, promises crime free state
Nothing stops you from paying N30,000 minimum wage to workers – APC challenges Makinde
EDIT: This script will scrape headlines from several pages:
import requests
from bs4 import BeautifulSoup
url = 'https://dailypost.ng/hot-news/page/{}/'
for page in range(1, 5): # <-- change how many pages do you want
print('Page no.{}'.format(page))
soup = BeautifulSoup(requests.get(url.format(page)).content, "html.parser")
mydivs = soup.findAll("span", {"class": "mvp-cd-date left relative"})
for tag in mydivs:
print(tag.find_next('h2').get_text(strip=True))
print('-' * 80)

Python Webscraping with BeautifulSoup not displaying full content

I am trying to scrape all the text from a webpage which is embedded within the "td" tags that have a class="calendar__cell calendar__currency currency ". As of now my code only returns the first occurence of this tag and class. How can I keep it iterating through the source code. So that it returns all occurrences one by one. The webpage is forexfactory.com
from bs4 import BeautifulSoup
import requests
source = requests.get("https://www.forexfactory.com/#detail=108867").text
soup = BeautifulSoup(source, 'lxml')
body = soup.find("body")
article = body.find("table", class_="calendar__table")
actual = article.find("td", class_="calendar__cell calendar__actual actual")
forecast = article.find("td", class_="calendar__cell calendar__forecast forecast").text
currency = article.find("td", class_="calendar__cell calendar__currency currency")
Tcurrency = currency.text
Tactual = actual.text
print(Tcurrency)
You have to use find_all() to get all elements and then you can use for-loop to iterate it.
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.forexfactory.com/#detail=108867")
soup = BeautifulSoup(r.text, 'lxml')
table = soup.find("table", class_="calendar__table")
for row in table.find_all('tr', class_='calendar__row--grey'):
currency = row.find("td", class_="currency")
#print(currency.prettify()) # before get text
currency = currency.get_text(strip=True)
actual = row.find("td", class_="actual")
actual = actual.get_text(strip=True)
forecast = row.find("td", class_="forecast")
forecast = forecast.get_text(strip=True)
print(currency, actual, forecast)
Result
CHF 96.4 94.6
EUR 0.8% 0.9%
GBP 43.7K 41.3K
EUR 1.35|1.3
USD -63.2B -69.2B
USD 0.0% 0.2%
USD 48.9 48.2
USD 1.2% 1.5%
BTW: I found that this page uses JavaScript to redirect page and in browser I see table with different values. But if I turn off JavaScript in browser then it shows me data which I get with Python code. BeautifulSoup and requests can't run JavaScript. If you need data like in browser then you may need Selenium to control web browser which can run JavaScript.

Categories