Search in Sub-Pages of a Main Webpage using BeautifulSoup - python

I am trying to search for a div with class = 'class', but I need to find all matches in the mainpage as well as in the sub (or children) pages. How can I do this using BeautifulSoup or anything else?
I have found the closest answer in this search
Search the frequency of words in the sub pages of a webpage using Python
but this method only retrieved partial result, the page of interest has many more subpages. Is there another way of doing this?
My code so far:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
subpages = []
for anchor in soup.find_all('a', href=True):
string = 'https://www.mainpage.nl/'+str(anchor['href'])
subpages.append(string)
for subpage in subpages:
try:
soup_sub = BeautifulSoup(requests.get(subpage).content, 'html.parser')
promotie = soup_sub.find_all('strong', class_='c-action-banner__subtitle')
if len(promotie) > 0:
print(promotie)
except Exception:
pass
Thanks!

Related

How to select all links of apps from app store and extract its href?

from bs4 import BeautifulSoup
import requests
from urllib.request import urlopen
url = f'https://www.apple.com/kr/search/youtube?src=globalnav'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
links = soup.select(".rf-serp-productname-list")
print(links)
I want to crawl through all links of shown apps. When I searched for a keyword, I thought links = soup.select(".rf-serp-productname-list") would work, but links list is empty.
What should I do?
Just check this code, I think is what you want:
import re
import requests
from bs4 import BeautifulSoup
pages = set()
def get_links(page_url):
global pages
pattern = re.compile("^(/)")
html = requests.get(f"your_URL{page_url}").text # fstrings require Python 3.6+
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all("a", href=pattern):
if "href" in link.attrs:
if link.attrs["href"] not in pages:
new_page = link.attrs["href"]
print(new_page)
pages.add(new_page)
get_links(new_page)
get_links("")
Source:
https://gist.github.com/AO8/f721b6736c8a4805e99e377e72d3edbf
You can change the part:
for link in soup.find_all("a", href=pattern):
#do something
To check for a keyword I think
You are cooking a soup so first at all taste it and check if everything you expect contains in it.
ResultSet of your selection is empty cause structure in response differs a bit from your expected one from the developer tools.
To get the list of links select more specific:
links = [a.get('href') for a in soup.select('a.icon')]
Output:
['https://apps.apple.com/kr/app/youtube/id544007664', 'https://apps.apple.com/kr/app/%EC%BF%A0%ED%8C%A1%ED%94%8C%EB%A0%88%EC%9D%B4/id1536885649', 'https://apps.apple.com/kr/app/youtube-music/id1017492454', 'https://apps.apple.com/kr/app/instagram/id389801252', 'https://apps.apple.com/kr/app/youtube-kids/id936971630', 'https://apps.apple.com/kr/app/youtube-studio/id888530356', 'https://apps.apple.com/kr/app/google-chrome/id535886823', 'https://apps.apple.com/kr/app/tiktok-%ED%8B%B1%ED%86%A1/id1235601864', 'https://apps.apple.com/kr/app/google/id284815942']

How to get specific text hyperlinks in the home webpage by BeautifulSoup?

I want to search all hyperlink that its text name includes "article" in https://www.geeksforgeeks.org/
for example, on the bottom of this webpage
Write an Article
Improve an Article
I want to get them all hyperlink and print them, so I tried to,
from urllib.request import urlopen
from bs4 import BeautifulSoup
import os
import re
url = 'https://www.geeksforgeeks.org/'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, "html.parser")
links = []
for link in soup.findAll('a',href = True):
#print(link.get("href")
if re.search('/article$', href):
links.append(link.get("href"))
However, it get a [] in result, how to solve it?
Here is something you can try:
Note that there are more links with the test article in the link you provided, but it gives the idea how you can deal with this.
In this case I just checked if the word article is in the text of that tag. You can use regex search there, but for this example it is an overkill.
import requests
from bs4 import BeautifulSoup
url = 'https://www.geeksforgeeks.org/'
res = requests.get(url)
if res.status_code != 200:
'no resquest'
soup = BeautifulSoup(res.content, "html.parser")
links_with_article = soup.findAll(lambda tag:tag.name=="a" and "article" in tag.text.lower())
EDIT:
If you know that there is a word in the href, i.e. in the link itself:
soup.select("a[href*=article]")
this will search for the word article in the href of all elements a.
Edit: get only href:
hrefs = [link.get('href') for link in links_with_article]

Exporting data from HTML to Excel

i just started programming.
I have the task to extract data from a HTML page to Excel.
Using Python 3.7.
My Problem is, that i have a website, whith more urls inside.
Behind these urls again more urls.
I need the data behind the third url.
My first Problem would be, how i can dictate the programm to choose only specific links from an ul rather then every ul on the page?
from bs4 import BeautifulSoup
import urllib
import requests
import re
page = urllib.request.urlopen("file").read()
soup = BeautifulSoup(page, "html.parser")
print(soup.prettify())
for link in soup.find_all("a", href=re.compile("katalog_")):
links= link.get("href")
if "katalog" in links:
for link in soup.find_all("a", href=re.compile("alle_")):
links = link.get("href")
print(soup.get_text())
There are many ways, one is to use "find_all" and try to be specific on the tags like "a" just like you did. If that's the only option, then use regular expression with your output. You can refer to this thread: Python BeautifulSoup Extract specific URLs. Also please show us either the link, or html structure of the links you want to extract. We would like to see the differences between the URLs.
PS: Sorry I can't make comments because of <50 reputation or I would have.
Updated answer based on understanding:
from bs4 import BeautifulSoup
import urllib
import requests
page = urllib.request.urlopen("https://www.bsi.bund.de/DE/Themen/ITGrundschutz/ITGrundschutzKompendium/itgrundschutzKompendium_node.html").read()
soup = BeautifulSoup(page, "html.parser")
for firstlink in soup.find_all("a",{"class":"RichTextIntLink NavNode"}):
firstlinks = firstlink.get("href")
if "bausteine" in firstlinks:
bausteinelinks = "https://www.bsi.bund.de/" + str(firstlinks.split(';')[0])
response = urllib.request.urlopen(bausteinelinks).read()
soup = BeautifulSoup(response, 'html.parser')
secondlink = "https://www.bsi.bund.de/" + str(((soup.find("a",{"class":"RichTextIntLink Basepage"})["href"]).split(';'))[0])
res = urllib.request.urlopen(secondlink).read()
soup = BeautifulSoup(res, 'html.parser')
listoftext = soup.find_all("div",{"id":"content"})
for text in listoftext:
print (text.text)

Beautifulsoup extracting urls from a given website menu

Hello every one I'm new to beautifulsoup, I'm trying to write a function that will be able to extract second level urls from a given website.
For example if I have this website url : https://edition.cnn.com/ my function should be able to return
https://edition.cnn.com/world
https://edition.cnn.com/politics
https://edition.cnn.com/business
https://edition.cnn.com/health
https://edition.cnn.com/entertainment
https://edition.cnn.com/style
https://edition.cnn.com/travel
first I have tried this code to retrieve all links starting with the string of the url:
from bs4 import BeautifulSoup as bs4
import requests
import lxml
import re
def getLinks(url):
response = requests.get(url)
data = response.text
soup = bs4(data, 'lxml')
links = []
for link in soup.find_all('a', href=re.compile(str(url))):
links.append(link.get('href'))
return links
But then again the actual output is giving me all the links even links of articles which is not I'm looking for. is there a method that I can use to get what I want using regular expression or others.
The links are inside <nav> tag, so using CSS selector nav a[href] will select only links inside <nav> tag:
import requests
from bs4 import BeautifulSoup
url = 'https://edition.cnn.com'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for a in soup.select('nav a[href]'):
if a['href'].count('/') > 1 or '#' in a['href']:
continue
print(url + a['href'])
Prints:
https://edition.cnn.com/world
https://edition.cnn.com/politics
https://edition.cnn.com/business
https://edition.cnn.com/health
https://edition.cnn.com/entertainment
https://edition.cnn.com/style
https://edition.cnn.com/travel
https://edition.cnn.com/sport
https://edition.cnn.com/videos
https://edition.cnn.com/world
https://edition.cnn.com/africa
https://edition.cnn.com/americas
https://edition.cnn.com/asia
https://edition.cnn.com/australia
https://edition.cnn.com/china
https://edition.cnn.com/europe
https://edition.cnn.com/india
https://edition.cnn.com/middle-east
https://edition.cnn.com/uk
...and so on.

BeautifulSoup, findAll after findAll?

I'm pretty new to Python and mainly need it for getting information from websites.
Here I tried to get the short headlines from the bottom of the website, but cant quite get them.
from bfs4 import BeautifulSoup
import requests
url = "http://some-website"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
nachrichten = soup.findAll('ul', {'class':'list'})
Now I would need another findAll to get all the links/a from the var "nachrichten", but how can I do this ?
Use a css selector with select if you want all the links in a single list:
anchors = soup.select('ul.list a')
If you want individual lists:
anchors = [ ul.find_all(a) for a in soup.find_all('ul', {'class':'list'})]
Also if you want the hrefs you can make sure you only find the anchors with href attributes and extract:
hrefs = [a["href"] for a in soup.select('ul.list a[href]')]
With find_all set href=True i.e ul.find_all(a, href=True) .
from bs4 import BeautifulSoup
import requests
url = "http://www.n-tv.de/ticker/"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
nachrichten = soup.findAll('ul', {'class':'list'})
links = []
for ul in nachrichten:
links.extend(ul.findAll('a'))
print len(links)
Hope this solves your problem and I think the import is bs4. I never herd of bfs4

Categories