Preserve space when stripping HTML with Beautiful Soup - python

from BeautifulSoup import BeautifulSoup
html = "<html><p>Para 1. Words</p><p>Merge. Para 2<blockquote>Quote 1<blockquote>Quote 2</p></html>"
print html
soup = BeautifulSoup(html)
print u''.join(soup.findAll(text=True))
The out put of this code is "Para 1 WordsMerge. Para 2Quote 1Quote 2".
I don't want the last word of paragraph one merging with the first word of paragraph two.
eg. "Para 1 Words Merge. Para 2 Quote 1 Quote 2".
Can this be achieved using the BeautifulSoup library?

And if you are using get_text() in version 4.x:
from bs4 import BeautifulSoup
...
...
soup.get_text(" ")

Just join the pieces with a space:
print u' '.join(soup.findAll(text=True))

Related

Python3 - Extract the text from a bs4.element.Tag and add to a dictonary

I am scraping a website which returns a bs4.element.Tag similar to the following:
<span class="attributes-value">
<span class="four-door">four door</span>
<span class="inline-4-engine">inline 4 engine</span>
<span class="24-gallons-per-mile">24 gallons per mile</span>
</span>
I am trying to extract just the text from this block and add it to a dictionary. All of the examples that I am seeing on the forum include some sort of common element like an 'id' or similar. I am not an html guy so i may be using incorrect terms.
What I would like to do is get the text ("four door", "v6 engine", etc) and add them as values to a dictionary with the key being a pre-designated variable of car_model.
cars = {'528i':['four door', 'inline 4 engine']}
I cant figure out a universal way to pull out the text because there may be more or fewer span classes with different text. Thanks for your help!
You need to loop through all the elements by selector and extract text value from these elements.
A selector is a specific path to the element you want. In my case, the selector is .attributes-value span, where .attributes-value allows you to access the class, and span allows you to access the tags within that class.
The get_text() method retrieves the content between the opening and closing tags. This is exactly what you need.
I also recommend using lxml because it will speed up your code.
The full code is attached below:
from bs4 import BeautifulSoup
import lxml
html = '''
<span class="attributes-value">
<span class="four-door">four door</span>
<span class="inline-4-engine">inline 4 engine</span>
<span class="24-gallons-per-mile">24 gallons per mile</span>
</span>
'''
soup = BeautifulSoup(html, 'lxml')
cars = {
'528i': []
}
for span in soup.select(".attributes-value span"):
cars['528i'].append(span.get_text())
print(cars)
Output:
{'528i': ['four door', 'inline 4 engine', '24 gallons per mile']}
You can use:
out = defaultdict(list)
soup = BeautifulSoup(html_doc, 'html.parser')
for tag in soup.select(".attributes-value span"):
out["528i"].append(tag.text)
print(dict(out))
Prints:
{'528i': ['four door', 'inline 4 engine', '24 gallons per mile']}

Beautiful Soup filtering for more than one keyword

CODE
soup = BeautifulSoup(urllib.request.urlopen(link['href']).read(), 'lxml')
# Find CompanyA links
for link in soup.findAll('a', href=True, text='CompanyA'):
print (link['href'])
Is it possible to filter for more than one, like this?
text='CompanyA' OR text='CompanyB' OR text='CompanyC'
This will give you all the elements which have a text attribute and match your list of texts.
soup.findAll('a', href=True, text=lambda value: value and value in ["CompanyA", "CompanyB", "CompanyC"])
Use regular expression.
import re
for link in soup.findAll("a", href=True,text=re.compile("CompanyA|CompanyB|CompanyC")):
print (link['href'])

How to find this generic tag inside a HTML code with BS4 (beautiful soup)

I am trying to findo this "generic" tag (there is only a "Span" Tag). I've tried a lot of things but none of them worked out. I tried the code below but brings me more than I want (I´m trying to reach the "573 m²" only...
Code:
Meters = [headline3.get_text() for headline3 in soup.find_all("ul", {"class": "feature__container"})]
Output:
['\n 573 m²\n \n 4 \n \n 4 \n \n 4 \n ',
HTML CODE (image): 1:
First, you can find all li elements. Then, for each li element get the first direct span child element and then access it's text.
Example:
meters = [li.find("span", recursive=False).get_text() for li in soup.find_all("li", { "class" : "feature__item" }) ]
Since there is no further way to exclude other values using HTML selectors (all are span tags), you might have to filter out values with m² in them manually to get your final output.
Like this:
result = list(map(int, [i.replace('m²', '').strip() for i in meters if 'm²' in i]))
Outputs:
[351, 573 ...]
Reference:
How to find children of nodes using BeautifulSoup

finding exact match of the text by using Beautifulsoup

i would like to extract the exact matching value of text from html by using beautifulsoup.But i am getting some almost matching text with my exact text.
my code is:
from bs4 import BeautifulSoup
import urllib2enter code here
url="http://www.somesite.com"
page=urllib2.urlopen(url)
soup=BeautifulSoup(page,"lxml")
for elem in soup(text=re.compile("exact text")):
print elem
for the above mentioned code output is like:
1.exact text
2.almost exact text
how can i get only the exact match by using beautifulsoup?
note:the variable(elem) should be in <class 'bs4.element.Comment'>type
You can search at your soup for the desired element, using it's tag and any attribute value.
I.e.: this code will search for all a elements with id equal to some_id_value.
Then it'll loop each element found, testing if it's .text value is equal to "exact text".
If so, it'll print the whole element.
for elem in soup.find_all('a', {'id':'some_id_value'}):
if elem.text == "exact text":
print(elem)
Use BeautifulSoup's find_all method with its string argument for this.
As an example, here I parse a small page from wikipedia about a place in Jamaica. I look for all strings whose texts are 'Jamaica stubs' but I expect to find just one. When I find it I display the text and its parent.
>>> url = 'https://en.wikipedia.org/wiki/Cassava_Piece'
>>> from bs4 import BeautifulSoup
>>> import requests
>>> page = requests.get(url).text
>>> soup = BeautifulSoup(page, 'lxml')
>>> for item in soup.find_all(string="Jamaica stubs"):
... item
... item.findParent()
...
'Jamaica stubs'
Jamaica stubs
On second thoughts, after reading the comment, a better way would be:
>>> url = 'https://en.wikipedia.org/wiki/Hockey'
>>> from bs4 import BeautifulSoup
>>> import requests
>>> import re
>>> page = requests.get(url).text
>>> soup = BeautifulSoup(page, 'lxml')
>>> for i, item in enumerate(soup.find_all(string=re.compile('women', re.IGNORECASE))):
... i, item.findParent().text[:100]
...
(0, "Women's Bandy World Championships")
(1, "The governing body is the 126-member International Hockey Federation (FIH). Men's field hockey has b")
(2, 'The governing body of international play is the 77-member International Ice Hockey Federation (IIHF)')
(3, "women's")
I use IGNORECASE in the regex so that both 'Women' and 'women' are found in the wikipedia article. I use enumerate in the for loop so that I can number the items that are displayed to make them easier to read.

Extracting text from html tags with beautiful soup

I have some html page to scrape data from.
So I need to get item title like here: 'Caliper Ring'.
I'm getting data from tag where that title appears:
item_title = base_page.find_all('h1', class_='itemTitle')
It contains these tags structure:
> [<h1 class="itemTitle"> <div class="l1">Caliper</div>
> Ball
> </h1>]
To extract 'Caliper Ball' I`m using
collector = []
for _ in item_title:
collector.append(_.text)
so I`m getting such ugly output in collector list:
[u"\nCaliper\r\n Ball\r\n "]
How can I make output clear like here "Caliper Ball"
Don't use regex. You're adding too much overhead for something simple. BeautifulSoup4 already has something for this called stripped_strings. See my code below.
from bs4 import BeautifulSoup as bsoup
html = """[<h1 class="itemTitle"> <div class="l1">Caliper</div>
Ball
</h1>]"""
soup = bsoup(html)
soup.prettify()
item = soup.find("h1", class_="itemTitle")
base = list(item.stripped_strings)
print " ".join(base)
Result:
Caliper Ball
[Finished in 0.5s]
Explanation: stripped_strings basically gets all the text inside a specified tag, strips them of all the spaces, line breaks, what have you. It returns a generator, which we can catch with list so it returns a list instead. Once it's a list, it's just a matter of using " ".join.
Let us know if this helps.
PS: Just to correct myself -- there's actually no need to use list on the result of stripped_strings, but it's better to show the above as such so it's explicit.
This regex will help you to get the output(Caliper Ball),
import re
str="""[<h1 class="itemTitle"> <div class="l1">Caliper</div>
Ball
</h1>]"""
regex = r'.*>([^<]*)<\/div>\s*\n\s*(\w*).*'
match = re.findall(regex, str)
new_data = (' '.join(w) for w in match)
print ''.join(new_data) # => Caliper Ball
you can use replace() method to replace \n and \r with nothing or space and after this use method trim() to remvoe spaces.

Categories