How do I take out whitespaces in bs4 output - python

So I'm making a scraper with bs4 that scrapes this userscripts website. But I'm running in to a issue where I cant remove whitespaces. Everything I've done doesn't work. Can someone help me?
from bs4 import BeautifulSoup
import requests
import os
url = "https://openuserjs.org"
source = requests.get(url)
soup = BeautifulSoup(source.text,'lxml')
os.system('cls')
for Titles in soup.findAll("a", {"class": "tr-link-a"}):
print(Titles.text.replace("Microsoft is aquiring GitHub", "").replace("TOS Changes", "").replace("Google Authentication Deprecation 2.0", "").replace("Server Maintenance", "").replace("rawgit.com Deprecation and EOL", ""))

To get the title without Announcements try below css selector.
for Titles in soup.select("a.tr-link-a>b"):
print(Titles.text.strip())
Output:
TopAndDownButtonsEverywhere
Anti-Adblock Killer | Reek
YouTube Center
EasyVideoDownload
AdsBypasser
Endless Google
YouTube +
Shadow Selection
bongacamsKillAds
Google View Image
Youtube - Restore Classic
Webcomic Reader
Shiki Rating
Warez-BB +
cinemapress
Google Hit Hider by Domain (Search Filter / Block Sites)
Chaturbate Clean
google cache comeback
translate.google tooltip
Amazon Smile Redirect
oujs - JsBeautify
IMDb 'My Movies' enhancer
EX-百度云盘
Wide Github
DuckDuckGo Extended
If you want to use findall() then try this.
for Titles in soup.findAll("a", {"class": "tr-link-a"}):
if Titles.find('b'):
print(Titles.find('b').text.strip())
Code:
from bs4 import BeautifulSoup
import requests
import os
url = "https://openuserjs.org"
source = requests.get(url)
soup = BeautifulSoup(source.text,'lxml')
for Titles in soup.findAll("a", {"class": "tr-link-a"}):
if Titles.find('b'):
print(Titles.find('b').text.strip())

Related

Web scraping youtube page

i'm trying to get the title of youtube videos given a link.
But i'm unable to access the element that hold the title. I'm using bs4 to parse the html.
I noticed im unable to access any element that is within 'ytd-app' tag in the youtube page.
import bs4
import requests
listed_url = "https://www.youtube.com/watch?v=9IfT8KXX_9c&list=PLQVvvaa0QuDfKTOs3Keq_kaG2P55YRn5v&index=31"
listed = requests.get(listed_url)
soup = bs4.BeautifulSoup(listed.text, "html.parser")
a = soup.find_all(attrs={"class": "style-scope ytd-video-primary-info-renderer"})
print(a)
So how can i get the video title ? Is there something i'm doing wrong or youtube intentionally created a tag like this to prevent web_scraping ?
See class that you are using is render through Javascript and all the contents are dynamic so it is very difficult to find any data using bs4
So what you can do find data in soup by manually and find particular tag
Also you can try out with pytube
import bs4
import requests
listed_url = "https://www.youtube.com/watch?v=9IfT8KXX_9c&list=PLQVvvaa0QuDfKTOs3Keq_kaG2P55YRn5v&index=31"
listed = requests.get(listed_url)
soup = bs4.BeautifulSoup(listed.text, "html.parser")
soup.find("title").get_text()

Getting a string in a specific tag with Beautiful Soup

I try to get all the titles from a website https://webscraper.io/test-sites. For that I use Beautiful Soup. The title (in this case E-commerce site) is always included in the following part of a code:
<h2 class="site-heading">
<a href="/test-sites/e-commerce/allinone">
E-commerce site
</a>
</h2>
I don't get that part. I already tried different things but for example the most intuitive code for me is not working:
import re
from bs4 import BeautifulSoup
import requests
url = 'https://webscraper.io/test-sites'
r = requests.get(url)
r_html = r.text
soup = BeautifulSoup(r_html)
string = soup.find_all("h2", string=re.compile("E-commerce")
How can I get just the title, in this case 'E-commerce site' for a list?
You are close. A few issues.
You are not using any parser to parse r_html. I have used html.parser here.
I don't see any need to use Regex re in your problem.
The titles are present inside h2 tags with class name - site-heading. You can select them.
This code selects all the titles and prints them.
from bs4 import BeautifulSoup
import requests
url = 'https://webscraper.io/test-sites'
r = requests.get(url)
r_html = r.text
soup = BeautifulSoup(r_html,"html.parser")
string = soup.find_all("h2", class_='site-heading')
for i in string:
print(i.text.strip())
E-commerce site
E-commerce site with pagination links
E-commerce site with popup links
E-commerce site with AJAX pagination links
E-commerce site with "Load more" buttons
E-commerce site that loads items while scrolling
Table playground
import re
import requests
from bs4 import BeautifulSoup
url = 'https://webscraper.io/test-sites'
r = requests.get(url)
r_html = r.text
soup = BeautifulSoup(r_html, features="html.parser")
h2s = soup.find_all("h2")
for h2 in h2s:
print(h2.text.strip())
this will give you all the texts in your H2s.
Let me know if this helps you.
If I understand you correctly, you want to get a list of all the titles available. You could do something like this:
titles = [x.getText() for x in soup.find_all("h2", {class_="site-heading"})]

Python BeautifulSoup trouble extracting titles from a page with JS

I'm having some serious issues trying to extract the titles from a webpage. I've done this before on some other sites but this one seems to be an issue because of the Javascript.
The test link is "https://www.thomasnet.com/products/adhesives-393009-1.html"
The first title I want extracted is "Toagosei America, Inc."
Here is my code:
import requests
from bs4 import BeautifulSoup
url = ("https://www.thomasnet.com/products/adhesives-393009-1.html")
r = requests.get(url).content
soup = BeautifulSoup(r, "html.parser")
print(soup.get_text())
Now if I run it like this, with get_text, i can find the titles in the result, however as soon as I change it to find_all or find, the titles are lost. I cant find them using web browser's inspect tool, because its all JS generated.
Any advice would be greatly appreciated.
You have to specify what to find, in this case <h2> to get first title:
import requests
from bs4 import BeautifulSoup
url = 'https://www.thomasnet.com/products/adhesives-393009-1.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
first_title = soup.find('h2')
print(first_title.text)
Prints:
Toagosei America, Inc.

Scraping pdfs from a webpage

I would like to download all financial reports for a given company from the Danish company register (csv register). An example could be Chr. Hansen Holding in the link below:
https://datacvr.virk.dk/data/visenhed?enhedstype=virksomhed&id=28318677&soeg=chr%20hansen&type=undefined&language=da
Specifically, I would like to download all the PDF under the tab "Regnskaber" (=Financial reports). I do not have previous experience with webscraping using Python. I tried using BeautifulSoup, but given my non-existing experience, I cannot find the correct way to search from the response.
Below are what I tried, but no data are printed (i.e. it did not find any pdfs).
from urllib.parse import urljoin
from bs4 import BeautifulSoup
web_page = "https://datacvr.virk.dk/data/visenhed?
enhedstype=virksomhed&id=28318677&soeg=chr%20hansen&type=undefined&language=da"
response = requests.get(web_page)
soup = BeautifulSoup(response.text)
soup.findAll('accordion-toggle')
for link in soup.select("a[href$='.pdf']"):
print(link['href'].split('/')[-1])
All help and guidance will be much appreciated.
you should use select instead of findAll
from urllib.parse import urljoin
from bs4 import BeautifulSoup
web_page = "https://datacvr.virk.dk/data/visenhed?
enhedstype=virksomhed&id=28318677&soeg=chr%20hansen&type=undefined&language=da"
response = requests.get(web_page)
soup = BeautifulSoup(response.text, 'lxml')
pdfs = soup.select('div[id="accordion-Regnskaber-og-nogletal"] a[data-type="PDF"]')
for link in pdfs:
print(link['href'].split('/')[-1])

Scraping a webpage for link titles and URLs utilizing BeautifulSoup

I have a webpage of popular articles which I want to scrape for each quoted webpage's hyperlink and the title of the article it's displaying.
The desired output of my script is a CSV file which lists each title and the article content in one line. So if there are 50 articles on this webpage, I want one file with 50 lines and 100 data points.
My problem here is that the article titles and their hyperlinks are contained in an SVG container, which is throwing me off. I've utilized BeautifulSoup for web scraping before but am not sure how to select each article's title and hyperlink. Any and all help is much appreciated.
import requests
from bs4 import BeautifulSoup
import re
res = requests.get('http://fundersandfounders.com/what-internet-thinks-based-on-media/')
res.raise_for_status()
playFile = open('top_articles.html', 'wb')
for chunk in res.iter_content(100000):
playFile.write(chunk)
f = open('top_articles.html')
soup = BeautifulSoup(f, 'html.parser')
links = soup.select('p') #i know this is where i'm messing up, but i'm not sure which selector to actually utilize so I'm using the paragraph selector as a place-holder
print(links)
I am aware that this is in effect a two step project: the current version of my script doesn't iterate through the list of all the hyperlinks whose actual content I'm going to be scraping. That's a second step which I can execute easily on my own, however if anyone would like to write that bit too, kudos to you.
You should do it in two steps:
parse the HTML and extract the link to the svg
download svg page, parse it with BeautifulSoup and extract the "bubbles"
Implementation:
from urllib.parse import urljoin # Python3
import requests
from bs4 import BeautifulSoup
base_url = 'http://fundersandfounders.com/what-internet-thinks-based-on-media/'
with requests.Session() as session:
# extract the link to svg
res = session.get(base_url)
soup = BeautifulSoup(res.content, 'html.parser')
svg = soup.select_one("object.svg-content")
svg_link = urljoin(base_url, svg["data"])
# download and parse svg
res = session.get(svg_link)
soup = BeautifulSoup(res.content, 'html.parser')
for article in soup.select("#bubbles .bgroup"):
title, resource = [item.get_text(strip=True, separator=" ") for item in article.select("a text")]
print("Title: '%s'; Resource: '%s'." % (title, resource))
Prints article titles and resources:
Title: 'CNET'; Resource: 'Android Apps That Extend Battery Life'.
Title: '5-Years-Old Shoots Sister'; Resource: 'CNN'.
Title: 'Samsung Galaxy Note II'; Resource: 'Engaget'.
...
Title: 'Predicting If a Couple Stays Together'; Resource: 'The Atlantic Magazine'.
Title: 'Why Doctors Die Differently'; Resource: 'The Wall Street Journal'.
Title: 'The Ideal Nap Length'; Resource: 'Lifehacker'.

Categories