Iterate through the resultset bs4

Iterate through the resultset bs4 - python

I have used bs4 to extract this resultset in bs4.
<div>
<div>
</div>
Content 1
</div>
<div>
Content 2
</div>
I am trying to extract these 2 elements.
Moi not cute not hot, the ugly bui bui type 1 and Actually, moi also dun know
from bs4 import BeautifulSoup
import urllib
import re
r = urllib.urlopen(
'http://forums.hardwarezone.com.sg/eat-drink-man-woman-16/%5Bofficial%5D-chit-chat-students-part-2-a-5526993-55.html').read()
soup = BeautifulSoup(r, "lxml")
letters = soup.find_all("div", attrs={"id":re.compile("post_message_\d+")})
Here is my code. However, how do i iterate through the result set so that it only extracts the content way before the closing div.
letters.find_all('div') returns an empty set.

All the messages:
from bs4 import BeautifulSoup
import urllib
import re
r = urllib.urlopen(
'http://forums.hardwarezone.com.sg/eat-drink-man-woman-16/%5Bofficial%5D-chit-chat-students-part-2-a-5526993-55.html').read()
soup = BeautifulSoup(r, "lxml")
letters = soup.find_all("div", attrs={"id":re.compile("post_message_\d+")})
for a in letters:
print [b.strip() for b in a.text.strip().split('\n') if b.strip()]

Related

Get html text with Beautiful Soup

I'm trying to get the number from inside a div:
<div class="tv-symbol-price-quote__value js-symbol-last">122.7<span class="">8</span></div>
I need the 122.7 number, but I cant get it. I have tried with:
strings = soup.find("div", class_="tv-symbol-price-quote__value js-symbol-last").string
But, there are more than one element and I receive "none".
Is there a way to print the childs and get the string from childs?

Use .getText().
For example:
from bs4 import BeautifulSoup
sample_html = """
<div class="tv-symbol-price-quote__value js-symbol-last">122.7<span class="">8</span></div>
"""
soup = BeautifulSoup(sample_html, "html.parser")
strings = soup.find("div", class_="tv-symbol-price-quote__value js-symbol-last").getText()
print(strings)
Output:
122.78
Or use __next__() to get only the 122.7.
soup = BeautifulSoup(sample_html, "html.parser")
strings = soup.find("div", class_="tv-symbol-price-quote__value js-symbol-last").strings.__next__()
print(strings)
Output:
122.7

To only get the first text, search for the tag, and call the next_element method.
from bs4 import BeautifulSoup
html = """
<div class="tv-symbol-price-quote__value js-symbol-last">122.7<span class="">8</span></div>
"""
soup = BeautifulSoup(html, "html.parser")
print(
soup.find("div", class_="tv-symbol-price-quote__value js-symbol-last").next_element
)
Output:
122.7

You could use selenium to find the element and then use BS4 to parse it.
An example would be
import selenium.webdriver as WD
from selenium.webdrive.chrome.options import Options
import bs4 as B
driver = WD.Chrome()
objXpath = driver.find_element_by_xpath("""yourelementxpath""")
objHtml = objXpath.get_attribute("outerHTML")
soup = B.BeutifulSoup(objHtml, 'html.parser')
text = soup.get_text()
This code should work.
DISCLAIMER
I haven't done work w/ selenium and bs4 in a while so you might have to tweak it a little bit.

How does Beautiful Soup extract class attribute values?

I use beautifulsoup to extract multiple attribute values of class, but ['fa', 'fa-address-book-o'] is not the result I want.
from bs4 import BeautifulSoup
html = "<i class='fa fa-address-book-o' aria-hidden='true'></i>"
soup = BeautifulSoup(html, "lxml")
h2 = soup.select("i")
print(h2[0]['class'])
I want the effect to be as follows:
fa fa-address-book-o

join all the elements in your list, and put a space between them
from bs4 import BeautifulSoup
html = "<i class='fa fa-address-book-o' aria-hidden='true'></i>"
soup = BeautifulSoup(html, "lxml")
h2 = soup.select("i")
print(' '.join(h2[0]['class']))

wrong python html parsing

My code:
from bs4 import BeautifulSoup
import urllib.request
url = "http://yaz.tek.firat.edu.tr/tr/duyurular"
url_oku = urllib.request.urlopen(url)
soup = BeautifulSoup(url_oku, 'html.parser')
icerik = soup.find_all('div',attrs={'class':'views-row views-row-1 views-row-odd views-row-first'})
print(kardiz)
my output :
[<div class="views-row views-row-1 views-row-odd views-row-first">
<span class="views-field views-field-title"> <span class="field-content">Grup-1, Grup-2, Grup-3, Grup-4 ve Grup-6 Öğrencileri İçin Staj Sunum Tarihleri</span> </span>
<span class="views-field views-field-created"> <span class="field-content"><i class="fa fa-calendar"></i> Salı, Aralık 5, 2017 - 09:58 </span> </span> </div>]
But I want to get just " Grup-1, Grup-2, Grup-3, Grup-4 ve Grup-6 Öğrencileri İçin Staj Sunum Tarihleri ". How can I achieve that?

You can call .text on a result from BeautifulSoup. It takes the textual content of the elements found, skipping the tags of the elements.
e.g.
from bs4 import BeautifulSoup
import urllib.request
url = "http://yaz.tek.firat.edu.tr/tr/duyurular"
url_oku = urllib.request.urlopen(url)
soup = BeautifulSoup(url_oku, 'html.parser')
icerik = soup.find_all('div',attrs={'class':'views-row views-row-1 views-row-odd views-row-first'})
for result in icerik:
print(result.text)

You can try like this as well to get the title and link from that page. I used css selector to get them:
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests
url = "http://yaz.tek.firat.edu.tr/tr/duyurular"
res = requests.get(url)
soup = BeautifulSoup(res.text,'lxml')
for item in soup.select("#content .field-content a"):
link = urljoin(url,item['href'])
print("Title: {}\nLink: {}\n".format(item.text,link))
Partial output:
Title: 2017-2018 Güz Dönemi Final Sınav Programı (TASLAK)
Link: http://yaz.tek.firat.edu.tr/tr/node/481
Title: NETAŞ İşyeri Eğitimi Mülakatları Hakkında Duyuru
Link: http://yaz.tek.firat.edu.tr/tr/node/480

how to extract an attribute value of div using BeautifulSoup

I have a div whose id is "img-cont"
<div class="img-cont-box" id="img-cont" style='background-image: url("http://example.com/example.jpg");'>
I want to extract the url in background-image using beautiful soup.How can I do it?

You can you find_all or find for the first match.
import re
soup = BeautifulSoup(html_str)
result = soup.find('div',attrs={'id':'img-cont','style':True})
if result is not None:
url = re.findall('\("(http.*)"\)',result['style']) # return a list.

Try this:
import re
from bs4 import BeautifulSoup
html = '''\
<div class="img-cont-box" \
id="img-cont" \
style='background-image: url("http://example.com/example.jpg");'>\
'''
soup = BeautifulSoup(html, 'html.parser')
div = soup.find('div', id='img-cont')
print(re.search(r'url\("(.+)"\)', div['style']).group(1))

Web crawling using python beautifulsoup

How to extract data that is inside <p> paragraph tags and <li> which are under a named <div> class?

Use the functions find() and find_all():
import requests
from bs4 import BeautifulSoup
url = '...'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
div = soup.find('div', {'class':'class-name'})
ps = div.find_all('p')
lis = div.find_all('li')
# print the content of all <p> tags
for p in ps:
print(p.text)
# print the content of all <li> tags
for li in lis:
print(li.text)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterate through the resultset bs4 - python

Related

Get html text with Beautiful Soup

How does Beautiful Soup extract class attribute values?

wrong python html parsing

how to extract an attribute value of div using BeautifulSoup

Web crawling using python beautifulsoup

Categories

Resources