I am very new to anything webscraping related and as I understand Requests and BeautifulSoup are the way to go in that.
I want to write a program which emails me only one paragraph of a given link every couple of hours (trying a new way to read blogs through the day)
Say this particular link 'https://fs.blog/mental-models/' has a a paragraph each on different models.
from bs4 import BeautifulSoup
import re
import requests
url = 'https://fs.blog/mental-models/'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
now soup has a wall of bits before the paragraph text begins: <p> this is what I want to read </p>
soup.title.string working perfectly fine, but I don't know how to move ahead from here pls.. any directions?
thanks
Loop over the soup.findAll('p') to find all the p tags and then use .text to get their text:
Furthermore, do all that under a div with the class rte since you don't want the footer paragraphs.
from bs4 import BeautifulSoup
import requests
url = 'https://fs.blog/mental-models/'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
divTag = soup.find_all("div", {"class": "rte"})
for tag in divTag:
pTags = tag.find_all('p')
for tag in pTags[:-2]: # to trim the last two irrelevant looking lines
print(tag.text)
OUTPUT:
Mental models are how we understand the world. Not only do they shape what we think and how we understand but they shape the connections and opportunities that we see.
.
.
.
5. Mutually Assured Destruction
Somewhat paradoxically, the stronger two opponents become, the less likely they may be to destroy one another. This process of mutually assured destruction occurs not just in warfare, as with the development of global nuclear warheads, but also in business, as with the avoidance of destructive price wars between competitors. However, in a fat-tailed world, it is also possible that mutually assured destruction scenarios simply make destruction more severe in the event of a mistake (pushing destruction into the “tails” of the distribution).
If you want the text of all the p tag, you can just loop on them using the find_all method:
from bs4 import BeautifulSoup
import re
import requests
url = 'https://fs.blog/mental-models/'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
print(soup)
data = soup.find_all('p')
for p in data:
text = p.get_text()
print(text)
EDIT:
Here is the code in order to have them separatly in a list. You can them apply a loop on the result list to remove empty string, unused characters like\n etc...
from bs4 import BeautifulSoup
import re
import requests
url = 'https://fs.blog/mental-models/'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
data = soup.find_all('p')
result = []
for p in data:
result.append(p.get_text())
print(result)
Here is the solution:
from bs4 import BeautifulSoup
import requests
import Clock
url = 'https://fs.blog/mental-models/'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
data = soup.find_all('p')
result = []
for p in data:
result.append(p.get_text())
Clock.schedule_interval(print(result), 60)
Related
I want to extract ads that contain two special Persian words "توافق" or "توافقی" from a website. I am using BeautifulSoup and split the content in the soup to find the ads that have my special words, but my code does not work, May you please help me?
Here is my simple code:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://divar.ir/s/tehran")
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find_all("div", attrs={"class": "kt-post-card__body"})
for content in results:
words = content.split()
if words == "توافقی" or words == "توافق":
print(content)
Since that توافقی is appeared in the div tags with kt-post-card__description class, I will use this. Then you can get the adds by using tag's properties like .previous_sibling or .parent or whatever...
import requests
from bs4 import BeautifulSoup
r = requests.get("https://divar.ir/s/tehran")
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find_all("div", attrs={"class": "kt-post-card__description"})
for content in results:
text = content.text
if "توافقی" in text or "توافق" in text:
print(content.previous_sibling) # It's the h2 title.
so basically you are trying to split bs4 class and hence its giving error. Before splitting it, you need to convert it into text string.
import re
from bs4 import BeautifulSoup
import requests
r = requests.get("https://divar.ir/s/tehran")
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find_all("div", attrs={"class": "kt-post-card__description"})
for content in results:
words = content.text.split()
if "توافقی" in words or "توافق" in words:
print(content)
There are differnet issues first one, also mentioned by #Tim Roberts, you have to compare the list items with in:
if 'توافقی' in words or 'توافق' in words:
Second you have to seperate the texts from each of the child elements, so use get_text() with separator:
words=content.get_text(' ', strip=True)
Note: requests do not render dynamic content, it justs focus on static one
Example
import requests
from bs4 import BeautifulSoup
r=requests.get('https://divar.ir/s/tehran')
soup=BeautifulSoup(r.text,'html.parser')
results=soup.find_all('div',attrs={'class':"kt-post-card__body"})
for content in results:
words=content.get_text(' ', strip=True)
if 'توافقی' in words or 'توافق' in words:
print(content.text)
An alternative in this specific case could be the use of css selectors, so you could select the whole <article> and pick elements you need:
results = soup.select('article:-soup-contains("توافقی"),article:-soup-contains("توافق")')
for item in results:
print(item.h2)
print(item.span)
from bs4 import BeautifulSoup
import requests
from urllib.request import urlopen
url = f'https://www.apple.com/kr/search/youtube?src=globalnav'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
links = soup.select(".rf-serp-productname-list")
print(links)
I want to crawl through all links of shown apps. When I searched for a keyword, I thought links = soup.select(".rf-serp-productname-list") would work, but links list is empty.
What should I do?
Just check this code, I think is what you want:
import re
import requests
from bs4 import BeautifulSoup
pages = set()
def get_links(page_url):
global pages
pattern = re.compile("^(/)")
html = requests.get(f"your_URL{page_url}").text # fstrings require Python 3.6+
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all("a", href=pattern):
if "href" in link.attrs:
if link.attrs["href"] not in pages:
new_page = link.attrs["href"]
print(new_page)
pages.add(new_page)
get_links(new_page)
get_links("")
Source:
https://gist.github.com/AO8/f721b6736c8a4805e99e377e72d3edbf
You can change the part:
for link in soup.find_all("a", href=pattern):
#do something
To check for a keyword I think
You are cooking a soup so first at all taste it and check if everything you expect contains in it.
ResultSet of your selection is empty cause structure in response differs a bit from your expected one from the developer tools.
To get the list of links select more specific:
links = [a.get('href') for a in soup.select('a.icon')]
Output:
['https://apps.apple.com/kr/app/youtube/id544007664', 'https://apps.apple.com/kr/app/%EC%BF%A0%ED%8C%A1%ED%94%8C%EB%A0%88%EC%9D%B4/id1536885649', 'https://apps.apple.com/kr/app/youtube-music/id1017492454', 'https://apps.apple.com/kr/app/instagram/id389801252', 'https://apps.apple.com/kr/app/youtube-kids/id936971630', 'https://apps.apple.com/kr/app/youtube-studio/id888530356', 'https://apps.apple.com/kr/app/google-chrome/id535886823', 'https://apps.apple.com/kr/app/tiktok-%ED%8B%B1%ED%86%A1/id1235601864', 'https://apps.apple.com/kr/app/google/id284815942']
I am trying to webscrape some article titles from a website. I do not want to include "Notes from the Editor" when I run my program, but for some reason this super simple and should be easy if statement on the last two lines isn't working, and still prints out Notes from the Editor. What's wrong?
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.cambridge.org/core/journals/american-political-science-review/issue/4061249B1054342207CEF9C50AEC68C5")
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.findAll('a', class_='part-link')
for result in results:
if result.text != 'Notes from the Editors':
print(result.text)
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.cambridge.org/core/journals/american-political-science-review/issue/4061249B1054342207CEF9C50AEC68C5")
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.findAll('a', class_='part-link')
for result in results:
if result.text != '\nNotes from the Editors\n':
print(result.text)
its because your if statement doesnt exactly match the if statement. uyou are not considering white spaces or enters. try this.
i just started programming.
I have the task to extract data from a HTML page to Excel.
Using Python 3.7.
My Problem is, that i have a website, whith more urls inside.
Behind these urls again more urls.
I need the data behind the third url.
My first Problem would be, how i can dictate the programm to choose only specific links from an ul rather then every ul on the page?
from bs4 import BeautifulSoup
import urllib
import requests
import re
page = urllib.request.urlopen("file").read()
soup = BeautifulSoup(page, "html.parser")
print(soup.prettify())
for link in soup.find_all("a", href=re.compile("katalog_")):
links= link.get("href")
if "katalog" in links:
for link in soup.find_all("a", href=re.compile("alle_")):
links = link.get("href")
print(soup.get_text())
There are many ways, one is to use "find_all" and try to be specific on the tags like "a" just like you did. If that's the only option, then use regular expression with your output. You can refer to this thread: Python BeautifulSoup Extract specific URLs. Also please show us either the link, or html structure of the links you want to extract. We would like to see the differences between the URLs.
PS: Sorry I can't make comments because of <50 reputation or I would have.
Updated answer based on understanding:
from bs4 import BeautifulSoup
import urllib
import requests
page = urllib.request.urlopen("https://www.bsi.bund.de/DE/Themen/ITGrundschutz/ITGrundschutzKompendium/itgrundschutzKompendium_node.html").read()
soup = BeautifulSoup(page, "html.parser")
for firstlink in soup.find_all("a",{"class":"RichTextIntLink NavNode"}):
firstlinks = firstlink.get("href")
if "bausteine" in firstlinks:
bausteinelinks = "https://www.bsi.bund.de/" + str(firstlinks.split(';')[0])
response = urllib.request.urlopen(bausteinelinks).read()
soup = BeautifulSoup(response, 'html.parser')
secondlink = "https://www.bsi.bund.de/" + str(((soup.find("a",{"class":"RichTextIntLink Basepage"})["href"]).split(';'))[0])
res = urllib.request.urlopen(secondlink).read()
soup = BeautifulSoup(res, 'html.parser')
listoftext = soup.find_all("div",{"id":"content"})
for text in listoftext:
print (text.text)
I am learning how to use beautifulsoup. I managed to parse the html and now I want to extract a list of links from the page. The problem is that I am only interested in some links and the only way I can think of is to take all the links after a certain word appears. Can I drop part of the soup before I start extracting? Thank you.
This is what I have:
# import libraries
import urllib2
from bs4 import BeautifulSoup
import pandas as pd
import os
import re
# specify the url
quote_page = 'https://econpapers.repec.org/RAS/pab7.htm'
# query the website and return the html to the variable page
page = urllib2.urlopen(quote_page)
# parse the html using beautiful soup and store in variable soup
soup = BeautifulSoup(page, 'html.parser')
print(soup)
#transform to pandas dataframe
pages1 = soup.find_all('li', )
print(pages1)
pages2 = pd.DataFrame({
"papers": pages1,
})
print(pages2)
And I need to drop the upper half of the links in page2 and the only way to differenciate the ones I want from the rest is a word that appears in the html, that is this line "<h2 class="colored">Journal Articles</h2>"
EDIT: I just noticed that I can also separate them by the begining of the link. I only want the ones that start with "/article/"
As well using css_selector:
# parse the html using beautiful soup and store in variable soup
soup = BeautifulSoup(page, 'lxml')
#print(BeautifulSoup.prettify(soup))
css_selector = 'a[href^="/article"]'
href_tag_list = soup.select(css_selector)
print("Href list size:", len(href_tag_list)) # check that you found datas, do if else if needed
href_link_list = [] #use urljoin probably needed at some point
for href_tag in href_tag_list:
href_link_list.append(href_tag['href'])
print("href:", href_tag['href'])
I used this reference web page which was provided by another stackflow user:
Web Link
NB: You will have to take off the list the "/article/".
There can be various ways to get all the href starting with "/article/". One of the simple ways to do this would be :
# import libraries
import urllib.request
from bs4 import BeautifulSoup
import os
import re
import ssl
# specify the url
quote_page = 'https://econpapers.repec.org/RAS/pab7.htm'
gcontext = ssl.SSLContext()
# query the website and return the html to the variable page
page = urllib.request.urlopen(quote_page, context=gcontext)
# parse the html using beautiful soup and store in variable soup
soup = BeautifulSoup(page, 'html.parser')
#print(soup)
# Anchor tags starting with "/article/"
anchor_tags = soup.find_all('a', href=re.compile("/article/"))
for link in anchor_tags:
print(link.get('href'))
This answer would be helpful as well. And, go through the quick start guide of BeautifulSoup, it has a very good and elaborative examples.