How to exclude an element with BeautifulSoup (Python) - python

I am trying to extract the article text from this article (https://www.vanityfair.com/style/society/2014/06/monica-lewinsky-humiliation-culture) and exclude the legal container in the bottom. The text part seems easy, but can't seem to get rid of the container. I have separated it with the legal variable for easier use.
Here is my code so far:
import requests
from bs4 import BeautifulSoup
base_url = 'https://www.vanityfair.com/style/society/2014/06/monica-lewinsky-humiliation-culture'
r = requests.get(base_url)
r_html = r.text
soup = BeautifulSoup(r_html)
legal = soup.find('div',{'class': 'legal-container'})
paragraphs = soup.find_all('p')
for text in paragraphs:
print text.get_text()
How should I go about this?

Always find the portion you want and see how you can extract that part alone, rather than getting all the text and then eliminating the unwanted ones.
In your case, the text you probably want are grouped in section tags within a div that has a class attribute of content drop-cap. You can get this using:
content_div = soup.find('div', {'class': 'content drop-cap'})
This way, you get the flexibility of grouping the text by sections:
sections = content_div.findAll('section')
However, if you still insist on getting all the paragraphs and exclude the legal container specifically, you can remove the legal container from the soup object.
From BeautifulSoup documentation:
decompose()
Tag.decompose() removes a tag from the tree, then completely destroys
it and its contents
If you choose to do this, then remove the tag(s) you don't want before extracting the text:
soup.find('div', {'class': 'legal-container'}).decompose()

Related

Scrape any string with Python + Beautiful Soup that contains 5 numbers

Im living in Germany, where ZIP Codes are in most of the cases a 5 digit number f.e. 53525. I would really like to extract that information from a website using beautiful Soup.
I am new to Python/Beautiful Soup and I am not sure how to translate "Find every 5 Numbers in a row + "SPACE"" into Python language.
import requests
import urllib.request,re
from bs4 import BeautifulSoup
source = requests.get('DOMAIN').text
soup = BeautifulSoup(source, 'lxml')
soup.find_all(NOTSUREHERE)
In the simplest scenario:
NOTSUREHEREshould be replaced by name = 'tag_name', being tag_name a possible tag in which you are certain to find ZIP codes (and no other numerical field that could be mistaken by a ZIP Code)
Then, each element of that object should be passed to re.findall(regex, string) being: regex = '([0-9]{5})' (from what I understand the pattern was) and string the element from which you're extracting ZIP Codes.
import requests
import urllib.request,re
from bs4 import BeautifulSoup
source = requests.get('DOMAIN').text
soup = BeautifulSoup(source, 'lxml')
tag_list = soup.find_all(name = 'tag_name')
match_list = []
for tag in tag_list:
match_list.append(re.findall('([0-9]{5})', str(tag)))
You should watch out for possible matches that aren't ZIP codes. It could be the case of refining the soup.find_all() call by adding more arguments. The documentation might give you even more options, but the attrs argument could be set to {'target_attribute':'target_att_value'} those being an attribute and a value that definitely mark a tag with a ZIP code.
EDIT: Regarding possible empty elements, this link has a very straightforward solution: Removing empty elements from an array in Python

How to get a specific word from html page using beautiful soup in python

I have to extract specific words from a HTML page and count the number of times the word has been repeated. How do I do this using beautiful soup in python? How do I pass the url in the soup and then count the words ?
This is my code till now. I have no idea what to do next.
import bs4 as bs
import urllib.request
source = urllib.request.urlopen('https://pythonprogramming.net/parsememcparseface/').read()
soup = bs.BeautifulSoup(source,'lxml')
for paragraph in soup.find_all('p'):
print(paragraph.string)
print(str(paragraph.text))
You could get all the text in the page using
soup.get_text()
After setting that to a variable you could then use the .count() method to find the amount that a certain string appears in the HTML page. e.g.
text = soup.get_text()
print (text.count('word'))
To make sure you aren't getting words inside words you could split everything with a space and then look for them in each index of the list. For example 'house' is inside 'houses' would be fixed by this.

Filtering out one string from a print statement in python/BeautifulSoup

I am using BeautifulSoup to scrape a website's many pages for comments. Each page of this website has the comment "[[commentMessage]]". I want to filter out this string so it does not print every time the code runs. I'm very new to python and BeautifulSoup, but I couldn't seems to find this after looking for a bit, though I may be searching for the wrong thing. Any suggestions? My code is below:
from bs4 import BeautifulSoup
import urllib
r = urllib.urlopen('website url').read()
soup = BeautifulSoup(r, "html.parser")
comments = soup.find_all("div", class_="commentMessage")
for element in comments:
print element.find("span").get_text()
All of the comments are in spans within divs of the class commentMessage, including the unnecessary comment "[[commentMessage]]".
A simple if should do
for element in comments:
text = element.find("span").get_text()
if "[[commentMessage]]" not in text:
print text

Python - Looping through HTML Tags and using IF

I am using python to extract data from a webpage. The webpage has a reoccurring html div tag with class = "result" which contains other data in it (such as location, organisation etc...). I am able to successfully loop through the html using beautiful soup but when I add a condition such as if a certain word ('NHS' for e.g.) exists in the segment it doesn't return anything - though I know certain segments contain it. This is the code:
soup = BeautifulSoup(content)
details = soup.findAll('div', {'class': 'result'})
for detail in details:
if 'NHS' in detail:
print detail
Hope my question makes sense...
findAll returns a list of tags, not strings. Perhaps convert them to strings?
s = "<p>golly</p><p>NHS</p><p>foo</p>"
soup = BeautifulSoup(s)
details = soup.findAll('p')
type(details[0]) # prints: <class 'BeautifulSoup.Tag'>
You are looking for a string amongst tags. Better to look for a string amongst strings...
for detail in details:
if 'NHS' in str(detail):
print detail

Python BeautifulSoup give multiple tags to findAll

I'm looking for a way to use findAll to get two tags, in the order they appear on the page.
Currently I have:
import requests
import BeautifulSoup
def get_soup(url):
request = requests.get(url)
page = request.text
soup = BeautifulSoup(page)
get_tags = soup.findAll('hr' and 'strong')
for each in get_tags:
print each
If I use that on a page with only 'em' or 'strong' in it then it will get me all of those tags, if I use on one with both it will get 'strong' tags.
Is there a way to do this? My main concern is preserving the order in which the tags are found.
You could pass a list, to find any of the given tags:
tags = soup.find_all(['hr', 'strong'])
Use regular expressions:
import re
get_tags = soup.findAll(re.compile(r'(hr|strong)'))
The expression r'(hr|strong)' will find either hr tags or strong tags.
To find multiple tags, you can use the , CSS selector, where you can specify multiple tags separated by a comma ,.
To use a CSS selector, use the .select_one() method instead of .find(), or .select() instead of .find_all().
For example, to select all <hr> and strong tags, separate the tags with a ,:
tags = soup.select('hr, strong')

Categories