the html code is like this:
<div class="AAA">Text of AAADisplay text of URL A</div>
<div class="BBB">Text of BBBDisplay text of URL B</div>
<div class="CCC">Text of CCC</div>
<div class="DDD">Text of DDD</div>
I want to parse the text for all the div, while check if there is url exist, if yes then also extract it out and display in output
output like this:
Text of AAA
Display text of URL A
......AAA/url
Text of BBB
Display text of URL B
......BBB/url
Text of CCC
Text of DDD
i tried to nest the loop of find_all('a') within find_all('div') loop, but messed up my output
from bs4 import BeautifulSoup
html="""
<div class="AAA">Text of AAADisplay text of URL A</div>
<div class="BBB">Text of BBBDisplay text of URL B</div>
<div class="CCC">Text of CCC</div>
<div class="DDD">Text of DDD</div>
"""
soup = BeautifulSoup(html, "lxml")
for div in soup.findAll('div'):
print(div.text)
try:
print(div.find('a').text)
print(div.find('a')["href"])
except AttributeError:
pass
Output
Text of AAADisplay text of URL A
Display text of URL A
......AAA/url
Text of BBBDisplay text of URL B
Display text of URL B
......BBB/url
Text of CCC
Text of DDD
Don't know what your code looks like, but the basic idea is something like this:
data = soup.findAll('div')
for div in data:
links = div.findAll('a')
for a in links:
print(a['href'])
print(a.text)
will give you the URL and the text.
You can loop over the divs and then print the elements of soup.contents:
s = """
<div class="AAA">Text of AAADisplay text of URL A .
</div>
<div class="BBB">Text of BBBDisplay text of URL B .
</div>
<div class="CCC">Text of CCC</div>
<div class="DDD">Text of DDD</div>
"""
from bs4 import BeautifulSoup as soup
for _text, *_next in map(lambda x:x.contents, soup(s, 'html.parser').find_all('div')):
print(_text)
if _next:
print(_next[0].text)
print(_next[0]['href'])
Output:
Text of AAA
Display text of URL A
......AAA/url
Text of BBB
Display text of URL B
......BBB/url
Text of CCC
Text of DDD
It just easier to read, You can also use this to get expected output
divs = soup.find_all('div')
for div in divs:
print(div.contents[0]) # Text of AAA
link = div.find('a')
if link:
print(link.text) # Display text of URL A
print(link['href']) # ......AAA/url
thanks all, i worked out the solution
for h in ans_kin:
links = ""
link = h.find('a')
if link:
for l in link:
links = h.text + link.get('href')
else:
links = h.text
answer_kin.append(links)
Related
So I'm trying to scrape a news website and get the actual text inside it. My problem right now is that the actual article is divided into several p tags who in turn are inside a div tag.
It looks like this:
<div>
<p><strong>S</strong>paragraph</p>
<p>paragraph</p>
<h2 class="more-on__heading">heading</h2>
<figure>fig</figure>
<h2>header/h2>
<p>text</p>
<p>text</p>
<p>text</p>
</div>
What I tried so far is this:
article = requests.get(url)
soup = BeautifulSoup(article.content, 'html.parser')
article_title = soup.find('h1').text
article_author = soup.find('a', class_='author-link').text
article_text = ''
for element in soup.find('div', class_='wysiwyg wysiwyg--all-content css-1vkfgk0'):
article_text += element.find('p').text
But it shows that 'NoneType' object has no attribute 'text'
Cause expected output is not that clear from the question - General approch would be to select all p in your div e.g. with css selectors extract the text and join() it by what ever you like:
article_text = '\n'.join(e.text for e in soup.select('div p'))
If you just like to scrape text from siblings of the h2 in your example use:
article_text = '\n'.join(e.text for e in soup.select('h2 ~ p'))
or with find() and find_next_siblings():
article_text = '\n'.join(e.text for e in soup.find('h2').find_next_siblings('p'))
Example
from bs4 import BeautifulSoup
html='''
<div>
<p><strong>S</strong>paragraph</p>
<p>paragraph</p>
<h2 class="more-on__heading">heading</h2>
<figure>fig</figure>
<h2>header/h2>
<p>text</p>
<p>text</p>
<p>text</p>
</div>'''
soup = BeautifulSoup(html)
article_text = '\n'.join(e.text for e in soup.select('h2 ~ p'))
print(article_text)
Output
text
text
text
I have an html that looks like this:
<h3>
Heading 3
</h3>
<ol>
<li>
<ol>
....
</li>
</ol>
Need to highlight the entire html starting from first ol. I have found this solution:
soup = bs4.BeautifulSoup(open('temp.html').read(), 'lxml')
new_h1 = soup.new_tag('h1')
new_h1.string = 'Hello '
mark = soup.new_tag('mark')
mark.string = 'World'
new_h1.append(mark)
h1 = soup.h1
h1.replace_with(new_h1)
print(soup.prettify())
Is there any way to highlight entire html without having to find out the specific text?
Edit:
This is what I mean by highlighted text
Edit:
I have tried this code but it only highlights the very innermost li:
for node in soup2.findAll('li'):
if not node.string:
continue
value = node.string
mark = soup2.new_tag('mark')
mark.string = value
node.replace_with(mark)
This will highlight all the <li> content.
As I have no clear idea of how your HTML code looks like, I have tried to highlight all the <li> content. You can modify this code to suit your requirements.
from bs4 import BeautifulSoup
with open('index.html') as f:
soup = BeautifulSoup(f.read(), 'html.parser')
tag = soup.findAll('li')
# Highlights the <li> content
for li in tag:
newtag = soup.new_tag('mark')
li.string.wrap(newtag)
print(soup)
After Highlighting: https://i.stack.imgur.com/iIbXk.jpg
I have below HTML content, wherein div tag looks like below
<div class="block">aaa
<p> bbb</p>
<p> ccc</p>
</div>
From above I want to extract text only as "aaa" and not other tags content.
When I do,
soup.find('div', {"class": "block"})
it gives me all the content as text and I want to avoid the contents of p tag.
Is there a method available in BeautifulSoup to do this?
Check the type of element,You could try:
from bs4 import BeautifulSoup
from bs4 import element
s = '''
<div class="block">aaa
<p> bbb</p>
<p> ccc</p>
<h1>ddd</h1>
</div>
'''
soup = BeautifulSoup(s, "lxml")
for e in soup.find('div', {"class": "block"}):
if type(e) == element.NavigableString and e.strip():
print(e.strip())
# aaa
And this will ignore all text in sub tags.
You can remove the p tags from that div, which effectively gives you the aaa text.
Here's how:
from bs4 import BeautifulSoup
sample = """<div class="block">aaa
<p> bbb</p>
<p> ccc</p>
</div>
"""
s = BeautifulSoup(sample, "html.parser")
excluded = [i.extract() for i in s.find("div", class_="block").find_all("p")]
print(s.text.strip())
Output:
aaa
You can use find_next(), which returns the first match found:
from bs4 import BeautifulSoup
html = '''
<div class="block">aaa
<p> bbb</p>
<p> ccc</p>
</div>
'''
soup = BeautifulSoup(html, "html.parser")
print(soup.find('div', {"class": "block"}).find_next(text=True))
Output:
aaa
I am trying to find an ID in a div class which has multiple values using BS4 the HTML is
<div class="size ">
<a class="selectSize"
id="25746"
data-ean="884751585616"
ata-test="170"
data-test1=""
data-test2="1061364-41"
data-test3-original="41"
data-test4-eu="41"
data-test5-uk="7"
data-test6-us="8"
data-test-cm="26"
</div>
</div>
I want to find data-test5-uk, my current code is soup =
bs(size.text,"html.parser")
sizes = soup.find_all("div",{"class":"size"})
size = sizes[0]["data-test5-uk"]
size.text is from a get request to the site with the html, however it returns
size = sizes[0]["data-test5-uk"]
File "C:\Users\ninja_000\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\element.py", line 1011, in __getitem__
return self.attrs[key]
KeyError: 'data-test5-uk'
Help is appreciated!
Explanation and then the solution.
.find_all('tag') is used to find all instances of that tag and we can later loop through them.
.find('tag') is used to find the ONLY first instance.
We can either extract the content of the argument with ['arg'] or ..get('arg') it is the SAME.
from bs4 import BeautifulSoup
html = '''<div class="size ">
<a class="selectSize"
id="25746"
data-ean="884751585616"
ata-test="170"
data-test1=""
data-test2="1061364-41"
data-test3-original="41"
data-test4-eu="41"
data-test5-uk="7"
data-test6-us="8"
data-test-cm="26"
</div>'''
soup = BeautifulSoup(html, 'lxml')
one_div = soup.find('div', class_='size ')
print( one_div.find('a')['data-test5-uk'])
# your code didn't work because you weren't in the a tag
# we have found the tag that contains the tag .find('a')['data-test5-uk']
# for multiple divs
for each in soup.find_all('div', class_='size '):
# we loop through each instance and do the same
datauk = each.find('a')['data-test5-uk']
print('data-test5-uk:', datauk)
Output:
data-test5-uk: 7
Additional
Why did your ['arg']? - You've tried to extract the ["data-test5-uk"] of the div. <div class="size "> the div has no arguments like that except one class="size "
Supposing I have an html string like this:
<html>
<div id="d1">
Text 1
</div>
<div id="d2">
Text 2
a url
Text 2 continue
</div>
<div id="d3">
Text 3
</div>
</html>
I want to extract the content of d2 that is NOT wrapped by other tags, skipping a url. In other words I want to get such result:
Text 2
Text 2 continue
Is there a way to do it with BeautifulSoup?
I tried this, but it is not correct:
soup = BeautifulSoup(html_doc, 'html.parser')
s = soup.find(id='d2').text
print(s)
Try with .find_all(text=True, recursive=False):
from bs4 import BeautifulSoup
div_test="""
<html>
<div id="d1">
Text 1
</div>
<div id="d2">
Text 2
a url
Text 2 continue
</div>
<div id="d3">
Text 3
</div>
</html>
"""
soup = BeautifulSoup(div_test, 'lxml')
s = soup.find(id='d2').find_all(text=True, recursive=False)
print(s)
print([e.strip() for e in s]) #remove space
it will return a list with only text:
[u'\n Text 2\n ', u'\n Text 2 continue\n ']
[u'Text 2', u'Text 2 continue']
You can get only the NavigableString objects with a simple list comprehension.
tag = soup.find(id='d2')
s = ''.join(e for e in tag if type(e) is bs4.element.NavigableString)
Alternatively you can use the decompose method to delete all the child nodes, then get all the remaining items with text .
tag = soup.find(id='d2')
for e in tag.find_all() :
e.decompose()
s = tag.text