I want to find "Moderat" in <p class="text-spread-level">Moderat</p>
I have tried with id, name, xpath and link text.
Would you like to try this?
from bs4 import BeautifulSoup
import requests
sentences = []
res = requests.get(url) # assign your url in variable
soup = BeautifulSoup(res.text, "lxml")
tag_list = soup.select("p.text-spread-level")
for tag in tag_list:
sentences.append(tag.text)
print(sentences)
Find the element by class name and get the text.
el=driver.find_element_by_class_name('text-spread-level')
val=el.text
print(val)
Related
I'm trying to get the number from inside a div:
<div class="tv-symbol-price-quote__value js-symbol-last">122.7<span class="">8</span></div>
I need the 122.7 number, but I cant get it. I have tried with:
strings = soup.find("div", class_="tv-symbol-price-quote__value js-symbol-last").string
But, there are more than one element and I receive "none".
Is there a way to print the childs and get the string from childs?
Use .getText().
For example:
from bs4 import BeautifulSoup
sample_html = """
<div class="tv-symbol-price-quote__value js-symbol-last">122.7<span class="">8</span></div>
"""
soup = BeautifulSoup(sample_html, "html.parser")
strings = soup.find("div", class_="tv-symbol-price-quote__value js-symbol-last").getText()
print(strings)
Output:
122.78
Or use __next__() to get only the 122.7.
soup = BeautifulSoup(sample_html, "html.parser")
strings = soup.find("div", class_="tv-symbol-price-quote__value js-symbol-last").strings.__next__()
print(strings)
Output:
122.7
To only get the first text, search for the tag, and call the next_element method.
from bs4 import BeautifulSoup
html = """
<div class="tv-symbol-price-quote__value js-symbol-last">122.7<span class="">8</span></div>
"""
soup = BeautifulSoup(html, "html.parser")
print(
soup.find("div", class_="tv-symbol-price-quote__value js-symbol-last").next_element
)
Output:
122.7
You could use selenium to find the element and then use BS4 to parse it.
An example would be
import selenium.webdriver as WD
from selenium.webdrive.chrome.options import Options
import bs4 as B
driver = WD.Chrome()
objXpath = driver.find_element_by_xpath("""yourelementxpath""")
objHtml = objXpath.get_attribute("outerHTML")
soup = B.BeutifulSoup(objHtml, 'html.parser')
text = soup.get_text()
This code should work.
DISCLAIMER
I haven't done work w/ selenium and bs4 in a while so you might have to tweak it a little bit.
Hello this is the code I want to grab first link from using BeautifulSoup.
view-source:https://www.binance.com/en/blog
I want to grab the first article here so it would be "Trust Wallet Now Supports Stellar Lumens, 4 More Tokens"
I am trying to use Python for this.
I use this code but it grabs all the links, I only want first one to grab
with open('binanceblog1.html', 'w') as article:
before13 = requests.get("https://www.binance.com/en/blog", headers=headers2)
data1b = before13.text
xsoup2 = BeautifulSoup(data1b, "lxml")
for div in xsoup2.findAll('div', attrs={'class':'title sc-0 iaymVT'}):
before_set13 = div.find('a')['href']
How can I do this?
Most simple solution I can think at this moment that works with your code is to use break, this is because of findAll
for div in xsoup2.findAll('div', attrs={'class':'title sc-62mpio-0 iIymVT'}):
before_set13 = div.find('a')['href']
break
For just the first element you can use find
before_set13 = soup.find('div', attrs={'class':'title sc-62mpio-0 iIymVT'}).find('a')['href']
Try (Extracting the href from 'Read more' button)
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.binance.com/en/blog')
soup = BeautifulSoup(r.text, "html.parser")
div = soup.find('div', attrs={'class': 'read-btn sc-62mpio-0 iIymVT'})
print(div.find('a')['href'])
You can assess the situation inside the loop and break when you find a satisfactory result.
for div in xsoup2.findAll('div', attrs={'class':'title sc-62mpio-0 iIymVT'}):
before_set13 = div.find('a')['href']
if before_set13 != '/en/blog':
break
print('skipping ' + before_set13)
print('grab ' + before_set13)
Output of the code with these changes:
skipping /en/blog
grab /en/blog/317619349105270784/Trust-Wallet-Now-Supports-Stellar-Lumens-4-More-Tokens
Use the class name with class css selector (.) for the content section then descendant combinator with a type css selector to specify child a tag element. select_one returns first match
soup.select_one('.content a')['href']
Code:
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://www.binance.com/en/blog')
soup = bs(r.content, 'lxml')
link = soup.select_one('.content a')['href']
print('https://www.binance.com' + link)
I'd like to scrape a site to findall title attributes of h2 tag
<h2 class="1">Titanic_Caprio</h2>
Using this code, I'm accessing the entire h2 tag
from bs4 import BeautifulSoup
import urllib2
url = "http://www.example.it"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
links = soup.findAll('h2')
print "".join([str(x) for x in links] )
using findAll('h2', attrs = {'title'}) doesn't have results. What Am I doing wrong? How can I print out the entire title's list in a file?
The problem is that title is not an attribute of the h2 tag, but of a tag included in it. So you must first search for <h2> tags, and then subtags having a title attribute:
titles = []
h2_list = links = soup.findAll('h2')
for h2 in h2_list:
titles.extend(h2.findAll(lambda x: x.has_attr('title')))
It works because BeautifulSoup can use functions as search filters.
you need to pass key value pairs in attrs
findAll('h2', attrs = {"key":"value"})
I've been trying to extract just the links corresponding to the jobs on each page. But for some reason they dont print when I execute the script. No errors occur.
for the inputs I put engineering, toronto respectively. Here is my code.
import requests
from bs4 import BeautifulSoup
import webbrowser
jobsearch = input("What type of job?: ")
location = input("What is your location: ")
url = ("https://ca.indeed.com/jobs?q=" + jobsearch + "&l=" + location)
r = requests.get(url)
rcontent = r.content
prettify = BeautifulSoup(rcontent, "html.parser")
all_job_url = []
for tag in prettify.find_all('div', {'data-tn-element':"jobTitle"}):
for links in tag.find_all('a'):
print (links['href'])
You should be looking for the anchor a tag. It looks like this:
<a class="turnstileLink" data-tn-element="jobTitle" href="/rc/clk?jk=3611ac98c0167102&fccid=459dce363200e1be" ...>Project <b>Engineer</b></a>
Call soup.find_all and iterate over the result set, extracting the links through the href attribute.
import requests
from bs4 import BeautifulSoup
# valid query, replace with something else
url = "https://ca.indeed.com/jobs?q=engineer&l=Calgary%2C+AB"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
all_job_url = []
for tag in soup.find_all('a', {'data-tn-element':"jobTitle"}):
all_job_url.append(tag['href'])
Thanks for attention!
I'm trying to retrieve the href of products in search result.
For example this page:
However When I narrow down to the product image class, the retrived href are image links....
Can anyone solve that? Thanks in advance!
url = 'http://www.homedepot.com/b/Husky/N-5yc1vZrd/Ntk-All/Ntt-chest%2Band%2Bcabinet?Ntx=mode+matchall&NCNI-5'
content = urllib2.urlopen(url).read()
content = preprocess_yelp_page(content)
soup = BeautifulSoup(content)
content = soup.findAll('div',{'class':'content dynamic'})
draft = str(content)
soup = BeautifulSoup(draft)
items = soup.findAll('div',{'class':'cell_section1'})
draft = str(items)
soup = BeautifulSoup(draft)
content = soup.findAll('div',{'class':'product-image'})
draft = str(content)
soup = BeautifulSoup(draft)
You don't need to load the content of each found tag with BeautifulSoup over and over again.
Use CSS selectors to get all product links (a tag under a div with class="product-image")
import urllib2
from bs4 import BeautifulSoup
url = 'http://www.homedepot.com/b/Husky/N-5yc1vZrd/Ntk-All/Ntt-chest%2Band%2Bcabinet?Ntx=mode+matchall&NCNI-5'
soup = BeautifulSoup(urllib2.urlopen(url))
for link in soup.select('div.product-image > a:nth-of-type(1)'):
print link.get('href')
Prints:
http://www.homedepot.com/p/Husky-41-in-16-Drawer-Tool-Chest-and-Cabinet-Set-HOTC4016B1QES/205080371
http://www.homedepot.com/p/Husky-26-in-6-Drawer-Chest-and-Cabinet-Combo-Black-C-296BF16/203420937
http://www.homedepot.com/p/Husky-52-in-18-Drawer-Tool-Chest-and-Cabinet-Set-Black-HOTC5218B1QES/204825971
http://www.homedepot.com/p/Husky-26-in-4-Drawer-All-Black-Tool-Cabinet-H4TR2R/204648170
...
div.product-image > a:nth-of-type(1) CSS selector would match every first a tag directly under the div with class product-image.
To save the links into a list, use a list comprehension:
links = [link.get('href') for link in soup.select('div.product-image > a:nth-of-type(1)')]