Python Selenium get text out of a class - python

Hey I want to have the text which is in this class and print it in the console.
How can I do this?
I have tried:
profileName = driver.find_element_by_class_name("._7UhW9.fKFbl.yUEEX.KV-D4.fDxYl").text()
print(profileName)
I would be very happy if you could help me.
Tom Rudolph

it not text() in python, it is .text
also this ._7UhW9.fKFbl.yUEEX.KV-D4.fDxYl looks dynamic and a css_selector not class_name.
Code 1 :
profileName = driver.find_element_by_css_selector("._7UhW9.fKFbl.yUEEX.KV-D4.fDxYl").text
print(profileName)
Code 2 :
profileName = driver.find_element_by_css_selector("._7UhW9.fKFbl.yUEEX.KV-D4.fDxYl").get_attribute('innerHTML')
print(profileName)

In case that element has all these class names you just have to change from find_element_by_class_name to find_element_by_css_selector, as following:
profileName = driver.find_element_by_css_selector("._7UhW9.fKFbl.yUEEX.KV-D4.fDxYl").text()
print(profileName)
However I'm quite sure these class names are dynamically changing so this will not work.
Also, make sure you added some delay / wait before accessing the element.

maybe try this
is from the BeautifulSoup library
from the beginning:
import requests
from bs4 import BeautifulSoup
url = 'my_site'
r = request.get (url)
re1 = BeautifulSoup (r.text, "lxml")
re2 = re1.body.find_all ('div', {'class': '._7UhW9.fKFbl.yUEEX.KV-D4.fDxYl'})
for syntax in re2 [:]:
print (syntax)
where there is div you can put another tag where this class is (a, p, h)
where there is div you can put another tag where this class is (a, p, h)
and in selenium, instead of 'element' use 'elements', when harvesting it is better to collect everything: D
But when you want to click an element or type something (send_keys () or click ()) you have to use a single number of characters, which is 'element' and not 'elements'
I do this for seleneium webdriver
data1 = driver.find_elements_by_class_name('class')
data2 = driver.find_elements_by_css_selector('css_selector")
if len(data1) > 0:
do something
if len(data2) > 0:
do something
you can always add an else if you want to see how the error pops up, print (len (data1)) or data2 to see how much data you have collected etc.

Related

How do you get a text from a span tag using BeautifulSoup when there's no clear identification?

enter image description here
I am trying to extract the value from this span tag for Year Built using BeautifulSoup and the following code below, but I'm not getting the actual Year. Please help. Thanks :)
enter image description here
results = []
for url in All_product[:2]:
link = url
html = getAndParseURL(url)
YearBuilt = html.findAll("span", {"class":"header font-color-
gray-light inline-block"})[4]
results.append([YearBuilt])
The output shows
[[<span class="header font-color-gray-light inline-block">Year Built</span>],
[<span class="header font-color-gray-light inline-block">Community</span>]]
Try using the .next_sibling:
result = []
year_built = html.find_all(
"span", {"class":"header font-color- gray-light inline-block"}
)
for elem in year_built:
if elem.text.strip() == 'Year Built':
result.append(elem.next_sibling)
I'm not sure how the whole HTML looks, but something along these lines might help.
Note: Sure there would be a more specific solution to extract all attributes for your results you may need, but therefor you should improve your question and add more details
Using css selectors you can simply chain / combinate your selection to be more strict. In this case you select the <span> contains your string and use adjacent sibling combinator to get the next sibling <span>.
YearBuilt = e.text if (e := html.select_one('span.header:-soup-contains("Year Built") + span')) else None
It also avoid AttributeError: 'NoneType' object has no attribute 'text', if element is not available you can check if it exists before calling text method
soup = BeautifulSoup(html_doc, "html.parser")
results = []
for url in All_product[:2]:
link = url
html = getAndParseURL(url)
YearBuilt = e.text if (e := html.select_one('span.header:-soup-contains("Year Built") + span')) else None
results.append([YearBuilt])

Is there a way to use select.one on bs4 with the "or" function

So basically i want to fetch data, and more specifically extract a text, from a website but the problem is that the thing that i want to search changes location.I'm sorry if i explain this badly,i just started learning python.
from bs4 import BeautifulSoup
import requests
url_patra = ("https://weather.com/el-GR/weather/today/l/a8c1d5fa8f854f3e5c626109483f1542b6eb8f29924330ccc44ffc07e3050bd7")
html_patra = BeautifulSoup(requests.get(url_patra).content, 'html.parser')
patra_prediction = html_patra.select_one("div[class*=CurrentConditions--phraseValue--2Z18W]").text
print (patra_prediction)
My problem is that sometimes it works with :
patra_prediction = html_patra.select_one("div[class*=CurrentConditions--phraseValue--2Z18W]").text
and sometimes with :
patra_prediction = html_patra.select_one("div[class*=CurrentConditions--precipValue--3nxCj]").text
I can't change everytime this specific line. So my final question is: Is there a way to use "or" function or something similar so that when the 1 line doesn't find the desired .text it uses the 2 line?
Here is the script to extract from that webpage what you need.
Location, weather, type of ....
from bs4 import BeautifulSoup
import requests
page_content = requests.get("https://weather.com/el-GR/weather/today/l/a8c1d5fa8f854f3e5c626109483f1542b6eb8f29924330ccc44ffc07e3050bd7").content
soup = BeautifulSoup(page_content, 'html.parser')
def find_location():
locator = 'div.CurrentConditions--header--27uOE h1.CurrentConditions--location--kyTeL' # CSS locator
item = soup.select_one(locator).string
print(item)
def find_weather():
locator = 'div.CurrentConditions--primary--2SVPh span.CurrentConditions--tempValue--3a50n'
weather = soup.select_one(locator).string
print(weather)
def find_weather_type():
locator = 'div.CurrentConditions--phraseValue--2Z18W'
type_of = soup.select_one(locator).string
print(type_of)
def info():
locator = "div.CurrentConditions--precipValue--3nxCj span"
info_text = soup.select_one(locator).string
print(info_text)
def as_of_time():
locator = "div.CurrentConditions--timestamp--23dfw"
as_of = soup.select_one(locator).string
print(as_of)
find_location()
find_weather()
find_weather_type()
info()
as_of_time()
there is no 'or' function.
as per the page source, I could see it is the current location.
Mostly the name of the location should be there.
if missed use null value validation while doing the extraction.
if (html_patra.select_one("div[class*=CurrentConditions--phraseValue--2Z18W]").text) is null:
loc = "No location"
else
loc = html_patra.select_one("div[class*=CurrentConditions--phraseValue--2Z18W]").text
I can't change everytime this specific line. So my final question is:
Is there a way to use "or" function or something similar so that when
the 1 line doesn't find the desired .text it uses the 2 line?
Good to know
[attr*=value] represents elements with an attribute whose value containing the substring value - So there is not really a need to handle it conditonal.
How to fix?
You are close to a solution - Just drop the dynamic generated part of the class:
html_patra.select_one("div[class*=CurrentConditions--phraseValue]").text
Note: As alternativ try to select via a different attribute that will not change that frequently -> html_patra.select_one('div[data-testid="wxPhrase"]').text
Example
from bs4 import BeautifulSoup
import requests
url_patra = ("https://weather.com/el-GR/weather/today/l/a8c1d5fa8f854f3e5c626109483f1542b6eb8f29924330ccc44ffc07e3050bd7")
html_patra = BeautifulSoup(requests.get(url_patra).content, 'html.parser')
patra_prediction = html_patra.select_one("div[class*=CurrentConditions--phraseValue]").text
print (patra_prediction)
Output
Νεφελώδης
Using try: and except: method
By testing a block of code for errors,if you are 100% sure that only these 2 locations are going to be the same and never change.
Example:
try:
patra_prediction = html_patra.select_one("div[class*=CurrentConditions--phraseValue--2Z18W]").text
except:
patra_prediction = html_patra.select_one("div[class*=CurrentConditions--precipValue--3nxCj]").text

How to get the text under the tag

I'm trying to get the text under the tag
I tried several different options:
dneyot=driver.find_elements_by_xpath("//*[starts-with(#id, 'popover-')]/text()")
dneyot=driver.find_elements_by_xpath("//*[starts-with(#id, 'popover-')]/b[1]/text()")
my piece of code:
dneyot=driver.find_elements_by_xpath("//*[starts-with(#id, 'popover-')]/text()")
for spisok in dneyot:
print("Период показов >3 дней", spisok.text)
UPD:
I find the items I need in the browser using :
//*[starts-with(#id, 'popover-')]/text()[1]
but get error
selenium.common.exceptions.InvalidSelectorException:
Message: invalid selector: The result of the xpath expression "//*[starts-with(#id, 'popover-')]/text()[1]" is: [object Text]. It should be an element.
If you want to get that text excluding the <b> node text then you need to use the below XPath:
//div[starts-with(#id, 'popover-')]
which will identify the div node and then by using find_elements_by_xpath() method, you can retrieve all the text from div node. Try the code below:
elements = driver.find_elements_by_xpath("//div[starts-with(#id, 'popover-')]")
for element in elements:
print(element.text)
Update:
I suspect, the above method may not work and we may not be able to identify/get that data using the normal methods - in that case you need to use JavaScriptExecutor to get the data like below :
driver = webdriver.Chrome('chromedriver.exe')
driver.get("file:///C:/NotBackedUp/SomeHTML.html")
xPath = "//div[starts-with(#id, 'popover-')]"
elements = driver.find_elements_by_xpath(xPath)
for element in elements:
lenght = int(driver.execute_script("return arguments[0].childNodes.length;", element));
for i in range(1, lenght + 1, 1):
try:
data = str(driver.execute_script("return arguments[0].childNodes["+str(i)+"].textContent;", element)).strip();
if data != None and data != '':
print data
except:
print "=> Can't print some data..."
As your site is written in some other language other than English, you may not able to print/get some data.
For getting specific child nodes data, you need to do like below :
from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
driver.get("file:///C:/NotBackedUp/SomeHTML.html")
xPath = "//div[starts-with(#id, 'popover-')]"
elements = driver.find_elements_by_xpath(xPath)
for element in elements:
# For print b1 text
b1Text = driver.execute_script("return arguments[0].childNodes[2].textContent", element);
print b1Text
# For printing b2 text
b2Text = driver.execute_script("return arguments[0].childNodes[6].textContent", element);
print b2Text
print("=> Done...")
I hope it helps...
Using Beautifulsoup:
Find the div with the id = popover-34252127 inside the parent div.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.your_url_here.com/")
soup = BeautifulSoup(page.content, 'html.parser')
data = soup.find("div", {"id": "popover-34252127"})
print(data)
find_elements_by_xpath() returns a webelement - the basic object selenium actually works with.
Your xpath ends with /text() - that will return you the text content of a node in an xml document - not an object selenium expects. So, just change it not to have that suffix - that will return the element itself, and get its (the element's) text by calling .text in Python:
dneyot=driver.find_elements_by_xpath("//*[starts-with(#id, 'popover-')]")
for element in dneyot:
print("Период показов >3 дней", element.text)
text() returns text node, selenium doesn't know how to handle it, it can only handle WebElements. You need to get the text for element with id "popover" and work with the returned text
elements = driver.find_elements_by_xpath("//*[starts-with(#id, 'popover-')]")
for element in elements:
lines = element.text.split('\n')
for line in lines:
print("Период показов >3 дней", line)
You can use Regular expression to get dates:
import re
#...
rePeriod = '(.*)(\\d{4}-\\d{2}-\\d{2} - \\d{4}-\\d{2}-\\d{2})(.*)'
dneyot = driver.find_elements_by_css_selector('div[id^="popover-"]')
for spisok in dneyot:
m = re.search(rePeriod, spisok.text)
print("Период показов >3 дней", m.group(2))

Finding specific tag using BeautifulSoup

Here is the website that i'm parsing: http://uniapple.net/usaddress/address.php?address1=501+10th+ave&address2=&city=nyc&state=ny&zipcode=10036&country=US
I would like to be able to find the word that will be in line 39 between the td tags. That line tells me if the address is residential or commercial, which is what I need for my script.
Here's what I have, but i'm getting this error:
AttributeError: 'NoneType' object has no attribute 'find_next'
The code I'm using is:
from bs4 import BeautifulSoup
import urllib
page = "http://uniapple.net/usaddress/address.php?address1=501+10th+ave&address2=&city=nyc&state=ny&zipcode=10036&country=US"
z = urllib.urlopen(page).read()
thesoup = BeautifulSoup(z, "html.parser")
comres = (thesoup.find("th",text=" Residential or ").find_next("td").text)
print(str(comres))
text argument would not work in this particular case. This is related to how the .string property of an element is calculated. Instead, I would use a search function where you can actually call get_text() and check the complete "text" of an element including the children nodes:
label = thesoup.find(lambda tag: tag and tag.name == "th" and \
"Residential" in tag.get_text())
comres = label.find_next("td").get_text()
print(str(comres))
Prints Commercial.
We can go a little bit further and make a reusable function to get a value by label:
soup = BeautifulSoup(z, "html.parser")
def get_value_by_label(soup, label):
label = soup.find(lambda tag: tag and tag.name == "th" and label in tag.get_text())
return label.find_next("td").get_text(strip=True)
print(get_value_by_label(soup, "Residential"))
print(get_value_by_label(soup, "City"))
Prints:
Commercial
NYC
All you are missing is a bit of housekeeping:
ths = thesoup.find_all("th")
for th in ths:
if 'Residential or' in th.text:
comres = th.find_next("td").text
print(str(comres))
>> Commercial
You'll need to use a regular expression as your text field, like re.compile('Residential or'), rather than a string.
This was working for me. I had to iterate over the results provided, though if you only expect a single result per page you could swap find for find_all:
for r in thesoup.find_all(text=re.compile('Residential or')):
r.find_next('td').text

Extending selection with BeautifulSoup

I am trying to get BeautifulSoup to do the following.
I have HTML files which I wish to modify. I am interested in two tags in particular, one which I will call TagA is
<div class ="A">...</div>
and one which I will call TagB
<p class = "B">...</p>
Both tags occur independently throughout the HTML and may themselves contain other tags and be nested inside other tags.
I want to place a marker tag around every TagA whenever it is not immediately followed by TagB so that
<p class="A"">...</p> becomes <marker><p class="A">...</p></marker>
But when TagA is followed immediately by TagB, I want the marker Tag to surround them both
so that
<p class="A">...</p><div class="B">...</div>
becomes
<marker><p class="A">...</p><div class="B">...</div></marker>
I can see how to select TagA and enclose it with the marker tag, but when it is followed by TagB I do not know if or how the BeautiulSoup 'selection' can be extended to include the NextSibling.
Any help appreciated.
beautifulSoup does have a "next sibling" function. find all tags of class A and use a.next_sibling to check if it is b.
look at the docs:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#going-sideways
I think I was going about this the wrong way by trying to extend the 'selection' from one tag to the following. Instead I found the following code which insets the outer 'Marker' tag and then inserts the A and B tags does the trick.
I am pretty new to Python so would appreciate advice regarding improvements or snags with the following.
def isTagB(tag):
#If tag is <p class = "B"> return true
#if not - or tag is just a string return false
try:
return tag.name == 'p'#has_key('p') and tag.has_key('B')
except:
return False
from bs4 import BeautifulSoup
soup = BeautifulSoup("""<div class = "A"><p><i>more content</i></p></div><div class = "A"><p><i>hello content</i></p></div><p class="B">da <i>de</i> da </p><div class = "fred">not content</div>""")
for TagA in soup.find_all("div", "A"):
Marker = soup.new_tag('Marker')
nexttag = TagA.next_sibling
#skipover white space
while str(nexttag).isspace():
nexttag = nexttag.next_sibling
if isTagB(nexttag):
TagA.replaceWith(Marker) #Put it where the A element is
Marker.insert(1,TagA)
Marker.insert(2,nexttag)
else:
#print("FALSE",nexttag)
TagA.replaceWith(Marker) #Put it where the A element is
Marker.insert(1,TagA)
print (soup)
import urllib
from BeautifulSoup import BeautifulSoup
html = urllib.urlopen("http://ursite.com") #gives html response
soup = BeautifulSoup(html)
all_div = soup.findAll("div",attrs={}) #use attrs as dict for attribute parsing
#exa- attrs={'class':"class","id":"1234"}
single_div = all_div[0]
#to find p tag inside single_div
p_tag_obj = single_div.find("p")
you can use obj.findNext(), obj.findAllNext(), obj.findALLPrevious(), obj.findPrevious(),
to get attribute you can use obj.get("href"), obj.get("title") etc.

Categories