I am writing a python program, using BeautifulSoup, that will retrieve a download link on a website. I am using the find method to retrieve the html class that the link is located in, but it is returning None.
I have tried using accessing this class using parent classes, but was unsuccessful.
Here is my code
link = 'https://data.worldbank.org/topic/agriculture-and-rural-development?view=chart'
for link in indicator_links:
indicator_page = requests.get(link)
indicator_soup = BeautifulSoup(page.text, 'html.parser')
download = indicator_soup.find(class_="btn-item download")
Again, I want the download link located inside the btn-item download html class.
Do you mean all links inside the btn-item download html class?
Change your code with this one:
link = 'https://data.worldbank.org/topic/agriculture-and-rural-development?view=chart'
page = requests.get(link)
indicator_soup = BeautifulSoup(page.text, 'html.parser')
download = indicator_soup.find(class_="btn-item download")
for lnk in download.find_all('a', href=True):
print(lnk['href'])
The problem was that I was creating the BeautifulSoup object with the wrong html argument.
It should have been:
indicator_soup = BeautifulSoup(indicator_page.text, 'html.parser')
instead of
indicator_soup = BeautifulSoup(page.text, 'html.parser')
If you want a link it will be 100% in a < a > tag.
This is the best I can do to give a helping hand:
from bs4 import BeautifulSoup
import urllib.request
page_url = "https://data.worldbank.org/topic/agriculture-and-rural-development?view=chart"
soup = BeautifulSoup(urllib.request.urlopen(page_url), 'lxml')
what_you_want = soup.find('a', clas_="btn-item download")
This should give you the link you want.
Not sure what you are trying to do in your code since I can't tell what indicator_links is.
Related
I am looking to download the "Latest File" from provided url below
https://www.abs.gov.au/statistics/economy/national-accounts/australian-national-accounts-national-income-expenditure-and-product
The file i want to download is at the following exact location
https://www.abs.gov.au/statistics/economy/national-accounts/australian-national-accounts-national-income-expenditure-and-product/sep-2022#data-downloads
for example file name is "Table 1"
how can i download this when i am only given the base URL as above? using beautifulSoup
I am unable to figure out how to work through nested urls within the html page to find the one i need to download.
First u need to get latest link:
latest_link = 'https://www.abs.gov.au/' + soup.find('span', class_='flag_latest').find_previous('a').get('href')
Then find document to download, in my example - download all, but u can change it:
download_all_link = 'https://www.abs.gov.au/' + soup.find('div', class_='anchor-button-wrapper').find('a').get('href')
And last point - download it.
FULL CODE:
import requests
from bs4 import BeautifulSoup
url = 'https://www.abs.gov.au/statistics/economy/national-accounts/australian-national-accounts-national-income-expenditure-and-product'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
latest_link = 'https://www.abs.gov.au/' + soup.find('span', class_='flag_latest').find_previous('a').get('href')
response = requests.get(latest_link)
soup = BeautifulSoup(response.text, 'lxml')
download_all_link = 'https://www.abs.gov.au/' + soup.find('div', class_='anchor-button-wrapper').find('a').get('href')
file_data = requests.get(download_all_link).content
with open(download_all_link.split("/")[-1], 'wb') as handler:
handler.write(file_data)
I've never used BeautifulSoup before. Pretty cool stuff. This seems to do it or me:
from bs4 import BeautifulSoup
with open("demo.html") as fp:
soup = BeautifulSoup(fp, "html.parser")
# lets look for the span with the 'flag_latest' class attribute
for span in soup.find_all('span'):
if span.get('class', None) and 'flag_latest' in span['class']:
# step up the a level to the div and grab the a tag
print(span.parent.a['href'])
So we just look for the span with the 'flag_latest' class and then step up a level in the tree (a div) and then grab the first a tag and extract the href.
Check out the docs and read the sections on "Navigating the Tree" and "Searching the Tree"
I am trying to parse this page "https://www.imdb.com/title/tt0068112/?ref_=fn_al_tt_1", but I can't find the href that I need (href="/title/tt0068112/episodes?ref_=tt_eps_sm").
I tried with this code:
url="https://www.imdb.com/title/tt0068112/?ref_=fn_al_tt_1"
page(requests.get(url)
soup=BeautifulSoup(page.content,"html.parser")
for a in soup.find_all('a'):
print(a['href'])
What's wrong with this? I also tried to check "manually" with print(soup.prettify()) but it seems that that link is hidden or something like that.
You can get the page html with requests, the href item is in there, no need for special apis. I tried this and it worked:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/title/tt0068112/?ref_=fn_al_tt_1")
soup = BeautifulSoup(page.content, "html.parser")
scooby_link = ""
for item in soup.findAll("a", href="/title/tt0068112/episodes?ref_=tt_eps_sm"):
print(item["href"])
scooby_link = "https://www.imdb.com" + "/title/tt0068112/episodes?ref_=tt_eps_sm"
print(scooby_link)
I'm assuming you also wanted to save the link to a variable for further scraping so I did that as well. 🙂
To get the link with Episodes you can use next example:
import requests
from bs4 import BeautifulSoup
url = "https://www.imdb.com/title/tt0068112/?ref_=fn_al_tt_1"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
print(soup.select_one("a:-soup-contains(Episodes)")["href"])
Prints:
/title/tt0068112/episodes?ref_=tt_eps_sm
i just started programming.
I have the task to extract data from a HTML page to Excel.
Using Python 3.7.
My Problem is, that i have a website, whith more urls inside.
Behind these urls again more urls.
I need the data behind the third url.
My first Problem would be, how i can dictate the programm to choose only specific links from an ul rather then every ul on the page?
from bs4 import BeautifulSoup
import urllib
import requests
import re
page = urllib.request.urlopen("file").read()
soup = BeautifulSoup(page, "html.parser")
print(soup.prettify())
for link in soup.find_all("a", href=re.compile("katalog_")):
links= link.get("href")
if "katalog" in links:
for link in soup.find_all("a", href=re.compile("alle_")):
links = link.get("href")
print(soup.get_text())
There are many ways, one is to use "find_all" and try to be specific on the tags like "a" just like you did. If that's the only option, then use regular expression with your output. You can refer to this thread: Python BeautifulSoup Extract specific URLs. Also please show us either the link, or html structure of the links you want to extract. We would like to see the differences between the URLs.
PS: Sorry I can't make comments because of <50 reputation or I would have.
Updated answer based on understanding:
from bs4 import BeautifulSoup
import urllib
import requests
page = urllib.request.urlopen("https://www.bsi.bund.de/DE/Themen/ITGrundschutz/ITGrundschutzKompendium/itgrundschutzKompendium_node.html").read()
soup = BeautifulSoup(page, "html.parser")
for firstlink in soup.find_all("a",{"class":"RichTextIntLink NavNode"}):
firstlinks = firstlink.get("href")
if "bausteine" in firstlinks:
bausteinelinks = "https://www.bsi.bund.de/" + str(firstlinks.split(';')[0])
response = urllib.request.urlopen(bausteinelinks).read()
soup = BeautifulSoup(response, 'html.parser')
secondlink = "https://www.bsi.bund.de/" + str(((soup.find("a",{"class":"RichTextIntLink Basepage"})["href"]).split(';'))[0])
res = urllib.request.urlopen(secondlink).read()
soup = BeautifulSoup(res, 'html.parser')
listoftext = soup.find_all("div",{"id":"content"})
for text in listoftext:
print (text.text)
I'm using BeautifulSoup to parse code of this site and extract URL of the results. But when using find_all command I get an empty list as output. I checked manually the HTML code that I download from the site, and it contains the appropriate class.
If somebody could point out where I make a mistake or show a better solution I would be grateful!
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.awf.edu.pl/pracownik/wyszukiwarka-pracownikow?result_5251_result_page=3&queries_search_query=&category_kategorie=wydzia_wychowania_fizycznego&search_page_5251_submit_button=Szukaj¤t_result_page=1&results_per_page=20&submitted_search_category=&mode=results")
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find_all('div', class_ = 'search-item photo')
`
I've also tried to use this code below to just find all links on the site and then separate that what I need, but in this instance, I get only parent tag. if in tag 'a' is nested another tag 'a' it is skipped, and from documentation, I thought it also would be included in the output.
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.awf.edu.pl/pracownik/wyszukiwarka-pracownikow?result_5251_result_page=3&queries_search_query=&category_kategorie=wydzia_wychowania_fizycznego&search_page_5251_submit_button=Szukaj¤t_result_page=1&results_per_page=20&submitted_search_category=&mode=results")
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find_all('a')
BeautifulSoup can't find class that exists on webpage?
I found this answer to a similar question, but in my case, I can see the HTML code that I want to find in my console when I use print(soup.prettify())
the problem you are facing is linked to the way you are parsing page.content.
replace:
soup = BeautifulSoup(page.content, 'html.parser')
with:
soup = BeautifulSoup(page.content, 'lxml')
hope this helps.
I am extracting data for a research project and I have sucessfully used findAll('div', attrs={'class':'someClassName'}) in many websites but this particular website,
WebSite Link
doesn't return any values when I used attrs option. But when I don't use the attrs option I get entire html dom.
Here is the simple code that I started with to test it out:
soup = bs(urlopen(url))
for div in soup.findAll('div', attrs={'class':'data'}):
print div
My code is working fine, with requests
import requests
from BeautifulSoup import BeautifulSoup as bs
#grab HTML
r = requests.get(r'http://www.amazon.com/s/ref=sr_pg_1?rh=n:172282,k%3adigital%20camera&keywords=digital%20camera&ie=UTF8&qid=1343600585')
html = r.text
#parse the HTML
soup = bs(html)
results= soup.findAll('div', attrs={'class': 'data'})
print results
If you or anyone reading this question would like to know the reason that the code wasn't able to find the attrs value using the code you've given (copied below):
soup = bs(urlopen(url))
for div in soup.findAll('div', attrs={'class':'data'}):
print div
The issue is when you attempted to create a BeautifulSoup object soup = bs(urlopen(url)) as the value of urlopen(url) is a response object and not the DOM.
I'm sure any issues you had encountered could have been more easily resolved by using bs(urlopen(url).read()) instead.