BeautifulSoup class searching, no results - python

I'm using BeautifulSoup to parse code of this site and extract URL of the results. But when using find_all command I get an empty list as output. I checked manually the HTML code that I download from the site, and it contains the appropriate class.
If somebody could point out where I make a mistake or show a better solution I would be grateful!
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.awf.edu.pl/pracownik/wyszukiwarka-pracownikow?result_5251_result_page=3&queries_search_query=&category_kategorie=wydzia_wychowania_fizycznego&search_page_5251_submit_button=Szukaj&current_result_page=1&results_per_page=20&submitted_search_category=&mode=results")
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find_all('div', class_ = 'search-item photo')
`
I've also tried to use this code below to just find all links on the site and then separate that what I need, but in this instance, I get only parent tag. if in tag 'a' is nested another tag 'a' it is skipped, and from documentation, I thought it also would be included in the output.
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.awf.edu.pl/pracownik/wyszukiwarka-pracownikow?result_5251_result_page=3&queries_search_query=&category_kategorie=wydzia_wychowania_fizycznego&search_page_5251_submit_button=Szukaj&current_result_page=1&results_per_page=20&submitted_search_category=&mode=results")
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find_all('a')
BeautifulSoup can't find class that exists on webpage?
I found this answer to a similar question, but in my case, I can see the HTML code that I want to find in my console when I use print(soup.prettify())

the problem you are facing is linked to the way you are parsing page.content.
replace:
soup = BeautifulSoup(page.content, 'html.parser')
with:
soup = BeautifulSoup(page.content, 'lxml')
hope this helps.

Related

Web scraping IMDB with Python's Beautiful Soup

I am trying to parse this page "https://www.imdb.com/title/tt0068112/?ref_=fn_al_tt_1", but I can't find the href that I need (href="/title/tt0068112/episodes?ref_=tt_eps_sm").
I tried with this code:
url="https://www.imdb.com/title/tt0068112/?ref_=fn_al_tt_1"
page(requests.get(url)
soup=BeautifulSoup(page.content,"html.parser")
for a in soup.find_all('a'):
print(a['href'])
What's wrong with this? I also tried to check "manually" with print(soup.prettify()) but it seems that that link is hidden or something like that.
You can get the page html with requests, the href item is in there, no need for special apis. I tried this and it worked:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/title/tt0068112/?ref_=fn_al_tt_1")
soup = BeautifulSoup(page.content, "html.parser")
scooby_link = ""
for item in soup.findAll("a", href="/title/tt0068112/episodes?ref_=tt_eps_sm"):
print(item["href"])
scooby_link = "https://www.imdb.com" + "/title/tt0068112/episodes?ref_=tt_eps_sm"
print(scooby_link)
I'm assuming you also wanted to save the link to a variable for further scraping so I did that as well. 🙂
To get the link with Episodes you can use next example:
import requests
from bs4 import BeautifulSoup
url = "https://www.imdb.com/title/tt0068112/?ref_=fn_al_tt_1"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
print(soup.select_one("a:-soup-contains(Episodes)")["href"])
Prints:
/title/tt0068112/episodes?ref_=tt_eps_sm

Can't figure out why soup.find_all() returns an empty list

I'm a novice in Python and am practicing web scraping by using BeautifulSoup.
I've checked some similar questions such as this one, this one, and this one. However, I'm still get stuck in my problem.
Here is my codes:
import urllib.request
from bs4 import BeautifulSoup
html = urllib.request.urlopen("https://en.wikipedia.org/wiki/List_of_largest_recorded_music_markets").read()
soup = BeautifulSoup(html, 'html.parser')
tbody = soup.find_all('table',{"class":"wikitable plainrowheaders sortable jquery-tablesorter"})
First, I don't think the web page I'm looking for contains java script that was mentioned in similar questions. I intend to extract the data in those tables, but when I executed print(tbody), I found it was an empty list. Can someone have a look and give me some hints?
Thank you.
You must remove the jquery-tablesorter part. It is dynamically applied after the page loads, so if you include it, it doesn't work.
This should work:
import urllib.request
from bs4 import BeautifulSoup
html = urllib.request.urlopen("https://en.wikipedia.org/wiki/List_of_largest_recorded_music_markets").read()
soup = BeautifulSoup(html, 'html.parser')
tbody = soup.find('table', {"class": "wikitable plainrowheaders sortable"})
print(tbody)

Exporting data from HTML to Excel

i just started programming.
I have the task to extract data from a HTML page to Excel.
Using Python 3.7.
My Problem is, that i have a website, whith more urls inside.
Behind these urls again more urls.
I need the data behind the third url.
My first Problem would be, how i can dictate the programm to choose only specific links from an ul rather then every ul on the page?
from bs4 import BeautifulSoup
import urllib
import requests
import re
page = urllib.request.urlopen("file").read()
soup = BeautifulSoup(page, "html.parser")
print(soup.prettify())
for link in soup.find_all("a", href=re.compile("katalog_")):
links= link.get("href")
if "katalog" in links:
for link in soup.find_all("a", href=re.compile("alle_")):
links = link.get("href")
print(soup.get_text())
There are many ways, one is to use "find_all" and try to be specific on the tags like "a" just like you did. If that's the only option, then use regular expression with your output. You can refer to this thread: Python BeautifulSoup Extract specific URLs. Also please show us either the link, or html structure of the links you want to extract. We would like to see the differences between the URLs.
PS: Sorry I can't make comments because of <50 reputation or I would have.
Updated answer based on understanding:
from bs4 import BeautifulSoup
import urllib
import requests
page = urllib.request.urlopen("https://www.bsi.bund.de/DE/Themen/ITGrundschutz/ITGrundschutzKompendium/itgrundschutzKompendium_node.html").read()
soup = BeautifulSoup(page, "html.parser")
for firstlink in soup.find_all("a",{"class":"RichTextIntLink NavNode"}):
firstlinks = firstlink.get("href")
if "bausteine" in firstlinks:
bausteinelinks = "https://www.bsi.bund.de/" + str(firstlinks.split(';')[0])
response = urllib.request.urlopen(bausteinelinks).read()
soup = BeautifulSoup(response, 'html.parser')
secondlink = "https://www.bsi.bund.de/" + str(((soup.find("a",{"class":"RichTextIntLink Basepage"})["href"]).split(';'))[0])
res = urllib.request.urlopen(secondlink).read()
soup = BeautifulSoup(res, 'html.parser')
listoftext = soup.find_all("div",{"id":"content"})
for text in listoftext:
print (text.text)

How do I extract just the blog content and exclude other elements using Beautiful Soup

I am trying to get the blog content from this blog post and by content, I just mean the first six paragraphs. This is what I've come up with so far:
soup = BeautifulSoup(url, 'lxml')
body = soup.find('div', class_='post-body')
Printing body will also include other stuff under the main div tag.
Try this:
import requests ; from bs4 import BeautifulSoup
res = requests.get("http://www.fashionpulis.com/2017/08/being-proud-too-soon.html").text
soup = BeautifulSoup(res, 'html.parser')
for item in soup.select("div#post-body-604825342214355274"):
print(item.text.strip())
Use this:
import requests ; from bs4 import BeautifulSoup
res = requests.get("http://www.fashionpulis.com/2017/08/acceptance-is-must.html").text
soup = BeautifulSoup(res, 'html.parser')
for item in soup.select("div[id^='post-body-']"):
print(item.text)
I found this solution very interesting: Scrape multiple pages with BeautifulSoup and Python
However, I haven't found any Query String Parameters to tackle on, maybe you can start something out of this approach.
What I find most obvious to do right now is something like this:
Scrape through every month and year and get all titles from the Blog Archive part of the pages (e.g. on http://www.fashionpulis.com/2017/03/ and so on)
Build the URLs using the titles and the according months/years (the URL is always http://www.fashionpulis.com/$YEAR/$MONTH/$TITLE.html)
Scrape the text as described by Shahin in a previous answer

Beautifulsoup unable to extract data using attrs=class

I am extracting data for a research project and I have sucessfully used findAll('div', attrs={'class':'someClassName'}) in many websites but this particular website,
WebSite Link
doesn't return any values when I used attrs option. But when I don't use the attrs option I get entire html dom.
Here is the simple code that I started with to test it out:
soup = bs(urlopen(url))
for div in soup.findAll('div', attrs={'class':'data'}):
print div
My code is working fine, with requests
import requests
from BeautifulSoup import BeautifulSoup as bs
#grab HTML
r = requests.get(r'http://www.amazon.com/s/ref=sr_pg_1?rh=n:172282,k%3adigital%20camera&keywords=digital%20camera&ie=UTF8&qid=1343600585')
html = r.text
#parse the HTML
soup = bs(html)
results= soup.findAll('div', attrs={'class': 'data'})
print results
If you or anyone reading this question would like to know the reason that the code wasn't able to find the attrs value using the code you've given (copied below):
soup = bs(urlopen(url))
for div in soup.findAll('div', attrs={'class':'data'}):
print div
The issue is when you attempted to create a BeautifulSoup object soup = bs(urlopen(url)) as the value of urlopen(url) is a response object and not the DOM.
I'm sure any issues you had encountered could have been more easily resolved by using bs(urlopen(url).read()) instead.

Categories