nested web scraping with beautifulSoup - python

I am looking to download the "Latest File" from provided url below
https://www.abs.gov.au/statistics/economy/national-accounts/australian-national-accounts-national-income-expenditure-and-product
The file i want to download is at the following exact location
https://www.abs.gov.au/statistics/economy/national-accounts/australian-national-accounts-national-income-expenditure-and-product/sep-2022#data-downloads
for example file name is "Table 1"
how can i download this when i am only given the base URL as above? using beautifulSoup
I am unable to figure out how to work through nested urls within the html page to find the one i need to download.

First u need to get latest link:
latest_link = 'https://www.abs.gov.au/' + soup.find('span', class_='flag_latest').find_previous('a').get('href')
Then find document to download, in my example - download all, but u can change it:
download_all_link = 'https://www.abs.gov.au/' + soup.find('div', class_='anchor-button-wrapper').find('a').get('href')
And last point - download it.
FULL CODE:
import requests
from bs4 import BeautifulSoup
url = 'https://www.abs.gov.au/statistics/economy/national-accounts/australian-national-accounts-national-income-expenditure-and-product'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
latest_link = 'https://www.abs.gov.au/' + soup.find('span', class_='flag_latest').find_previous('a').get('href')
response = requests.get(latest_link)
soup = BeautifulSoup(response.text, 'lxml')
download_all_link = 'https://www.abs.gov.au/' + soup.find('div', class_='anchor-button-wrapper').find('a').get('href')
file_data = requests.get(download_all_link).content
with open(download_all_link.split("/")[-1], 'wb') as handler:
handler.write(file_data)

I've never used BeautifulSoup before. Pretty cool stuff. This seems to do it or me:
from bs4 import BeautifulSoup
with open("demo.html") as fp:
soup = BeautifulSoup(fp, "html.parser")
# lets look for the span with the 'flag_latest' class attribute
for span in soup.find_all('span'):
if span.get('class', None) and 'flag_latest' in span['class']:
# step up the a level to the div and grab the a tag
print(span.parent.a['href'])
So we just look for the span with the 'flag_latest' class and then step up a level in the tree (a div) and then grab the first a tag and extract the href.
Check out the docs and read the sections on "Navigating the Tree" and "Searching the Tree"

Related

How to select all links of apps from app store and extract its href?

from bs4 import BeautifulSoup
import requests
from urllib.request import urlopen
url = f'https://www.apple.com/kr/search/youtube?src=globalnav'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
links = soup.select(".rf-serp-productname-list")
print(links)
I want to crawl through all links of shown apps. When I searched for a keyword, I thought links = soup.select(".rf-serp-productname-list") would work, but links list is empty.
What should I do?
Just check this code, I think is what you want:
import re
import requests
from bs4 import BeautifulSoup
pages = set()
def get_links(page_url):
global pages
pattern = re.compile("^(/)")
html = requests.get(f"your_URL{page_url}").text # fstrings require Python 3.6+
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all("a", href=pattern):
if "href" in link.attrs:
if link.attrs["href"] not in pages:
new_page = link.attrs["href"]
print(new_page)
pages.add(new_page)
get_links(new_page)
get_links("")
Source:
https://gist.github.com/AO8/f721b6736c8a4805e99e377e72d3edbf
You can change the part:
for link in soup.find_all("a", href=pattern):
#do something
To check for a keyword I think
You are cooking a soup so first at all taste it and check if everything you expect contains in it.
ResultSet of your selection is empty cause structure in response differs a bit from your expected one from the developer tools.
To get the list of links select more specific:
links = [a.get('href') for a in soup.select('a.icon')]
Output:
['https://apps.apple.com/kr/app/youtube/id544007664', 'https://apps.apple.com/kr/app/%EC%BF%A0%ED%8C%A1%ED%94%8C%EB%A0%88%EC%9D%B4/id1536885649', 'https://apps.apple.com/kr/app/youtube-music/id1017492454', 'https://apps.apple.com/kr/app/instagram/id389801252', 'https://apps.apple.com/kr/app/youtube-kids/id936971630', 'https://apps.apple.com/kr/app/youtube-studio/id888530356', 'https://apps.apple.com/kr/app/google-chrome/id535886823', 'https://apps.apple.com/kr/app/tiktok-%ED%8B%B1%ED%86%A1/id1235601864', 'https://apps.apple.com/kr/app/google/id284815942']

How to get specific text hyperlinks in the home webpage by BeautifulSoup?

I want to search all hyperlink that its text name includes "article" in https://www.geeksforgeeks.org/
for example, on the bottom of this webpage
Write an Article
Improve an Article
I want to get them all hyperlink and print them, so I tried to,
from urllib.request import urlopen
from bs4 import BeautifulSoup
import os
import re
url = 'https://www.geeksforgeeks.org/'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, "html.parser")
links = []
for link in soup.findAll('a',href = True):
#print(link.get("href")
if re.search('/article$', href):
links.append(link.get("href"))
However, it get a [] in result, how to solve it?
Here is something you can try:
Note that there are more links with the test article in the link you provided, but it gives the idea how you can deal with this.
In this case I just checked if the word article is in the text of that tag. You can use regex search there, but for this example it is an overkill.
import requests
from bs4 import BeautifulSoup
url = 'https://www.geeksforgeeks.org/'
res = requests.get(url)
if res.status_code != 200:
'no resquest'
soup = BeautifulSoup(res.content, "html.parser")
links_with_article = soup.findAll(lambda tag:tag.name=="a" and "article" in tag.text.lower())
EDIT:
If you know that there is a word in the href, i.e. in the link itself:
soup.select("a[href*=article]")
this will search for the word article in the href of all elements a.
Edit: get only href:
hrefs = [link.get('href') for link in links_with_article]

Exporting data from HTML to Excel

i just started programming.
I have the task to extract data from a HTML page to Excel.
Using Python 3.7.
My Problem is, that i have a website, whith more urls inside.
Behind these urls again more urls.
I need the data behind the third url.
My first Problem would be, how i can dictate the programm to choose only specific links from an ul rather then every ul on the page?
from bs4 import BeautifulSoup
import urllib
import requests
import re
page = urllib.request.urlopen("file").read()
soup = BeautifulSoup(page, "html.parser")
print(soup.prettify())
for link in soup.find_all("a", href=re.compile("katalog_")):
links= link.get("href")
if "katalog" in links:
for link in soup.find_all("a", href=re.compile("alle_")):
links = link.get("href")
print(soup.get_text())
There are many ways, one is to use "find_all" and try to be specific on the tags like "a" just like you did. If that's the only option, then use regular expression with your output. You can refer to this thread: Python BeautifulSoup Extract specific URLs. Also please show us either the link, or html structure of the links you want to extract. We would like to see the differences between the URLs.
PS: Sorry I can't make comments because of <50 reputation or I would have.
Updated answer based on understanding:
from bs4 import BeautifulSoup
import urllib
import requests
page = urllib.request.urlopen("https://www.bsi.bund.de/DE/Themen/ITGrundschutz/ITGrundschutzKompendium/itgrundschutzKompendium_node.html").read()
soup = BeautifulSoup(page, "html.parser")
for firstlink in soup.find_all("a",{"class":"RichTextIntLink NavNode"}):
firstlinks = firstlink.get("href")
if "bausteine" in firstlinks:
bausteinelinks = "https://www.bsi.bund.de/" + str(firstlinks.split(';')[0])
response = urllib.request.urlopen(bausteinelinks).read()
soup = BeautifulSoup(response, 'html.parser')
secondlink = "https://www.bsi.bund.de/" + str(((soup.find("a",{"class":"RichTextIntLink Basepage"})["href"]).split(';'))[0])
res = urllib.request.urlopen(secondlink).read()
soup = BeautifulSoup(res, 'html.parser')
listoftext = soup.find_all("div",{"id":"content"})
for text in listoftext:
print (text.text)

Beautifulsoup extracting urls from a given website menu

Hello every one I'm new to beautifulsoup, I'm trying to write a function that will be able to extract second level urls from a given website.
For example if I have this website url : https://edition.cnn.com/ my function should be able to return
https://edition.cnn.com/world
https://edition.cnn.com/politics
https://edition.cnn.com/business
https://edition.cnn.com/health
https://edition.cnn.com/entertainment
https://edition.cnn.com/style
https://edition.cnn.com/travel
first I have tried this code to retrieve all links starting with the string of the url:
from bs4 import BeautifulSoup as bs4
import requests
import lxml
import re
def getLinks(url):
response = requests.get(url)
data = response.text
soup = bs4(data, 'lxml')
links = []
for link in soup.find_all('a', href=re.compile(str(url))):
links.append(link.get('href'))
return links
But then again the actual output is giving me all the links even links of articles which is not I'm looking for. is there a method that I can use to get what I want using regular expression or others.
The links are inside <nav> tag, so using CSS selector nav a[href] will select only links inside <nav> tag:
import requests
from bs4 import BeautifulSoup
url = 'https://edition.cnn.com'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for a in soup.select('nav a[href]'):
if a['href'].count('/') > 1 or '#' in a['href']:
continue
print(url + a['href'])
Prints:
https://edition.cnn.com/world
https://edition.cnn.com/politics
https://edition.cnn.com/business
https://edition.cnn.com/health
https://edition.cnn.com/entertainment
https://edition.cnn.com/style
https://edition.cnn.com/travel
https://edition.cnn.com/sport
https://edition.cnn.com/videos
https://edition.cnn.com/world
https://edition.cnn.com/africa
https://edition.cnn.com/americas
https://edition.cnn.com/asia
https://edition.cnn.com/australia
https://edition.cnn.com/china
https://edition.cnn.com/europe
https://edition.cnn.com/india
https://edition.cnn.com/middle-east
https://edition.cnn.com/uk
...and so on.

BeautifulSoup class find returning None

I am writing a python program, using BeautifulSoup, that will retrieve a download link on a website. I am using the find method to retrieve the html class that the link is located in, but it is returning None.
I have tried using accessing this class using parent classes, but was unsuccessful.
Here is my code
link = 'https://data.worldbank.org/topic/agriculture-and-rural-development?view=chart'
for link in indicator_links:
indicator_page = requests.get(link)
indicator_soup = BeautifulSoup(page.text, 'html.parser')
download = indicator_soup.find(class_="btn-item download")
Again, I want the download link located inside the btn-item download html class.
Do you mean all links inside the btn-item download html class?
Change your code with this one:
link = 'https://data.worldbank.org/topic/agriculture-and-rural-development?view=chart'
page = requests.get(link)
indicator_soup = BeautifulSoup(page.text, 'html.parser')
download = indicator_soup.find(class_="btn-item download")
for lnk in download.find_all('a', href=True):
print(lnk['href'])
The problem was that I was creating the BeautifulSoup object with the wrong html argument.
It should have been:
indicator_soup = BeautifulSoup(indicator_page.text, 'html.parser')
instead of
indicator_soup = BeautifulSoup(page.text, 'html.parser')
If you want a link it will be 100% in a < a > tag.
This is the best I can do to give a helping hand:
from bs4 import BeautifulSoup
import urllib.request
page_url = "https://data.worldbank.org/topic/agriculture-and-rural-development?view=chart"
soup = BeautifulSoup(urllib.request.urlopen(page_url), 'lxml')
what_you_want = soup.find('a', clas_="btn-item download")
This should give you the link you want.
Not sure what you are trying to do in your code since I can't tell what indicator_links is.

Categories