detecting the presence of text with BeautifulSoup - python

i trying check the presence of text on certain page(if you send precendently the text will appear in this zone otherwize it's blank).
html= urlopen(single_link)
parsed= BeautifulSoup.BeautifulSoup(html,'html.parser')
lastmessages = parsed.find('div',attrs={'id':'message-box'})
if lastmessages :
print('Already Reached')
else:
print('you can write your message')
<div class="lastMessage">
<div class="mine messages">
<div class="message last" id="msg-snd-1601248710299" dir="auto">
Hello, and how are you ?
</div>
<div style="clear : both ;"></div>
<div class="msg-status" id="msg-status-1601248710299" dir="rtl">
<div class="send-state">
Last message :
<span class="r2">before 7:35 </span>
</div>
<div class="read-state">
<span style="color : gray ;"> – </span>
Reading :
<span class="r2">Not yet</span>
</div>
</div>
<div style="clear : both ;"></div>
</div>
</div>
my problem is i can't know how to find if the text "Hello, and how are you ?" exist or not ???

Simple solution
import bs4
parsed= bs4.BeautifulSoup(html,'html.parser')
lastmessages = parsed.find('div', class_='message last')
if lastmessages :
print(f'{lastmessages.text.strip()}')
else:
print('No message')

Related

Confirm preceding and following sibling in XPATH

I've got the below statement to check that 2 conditions exist in a
an element:
if len(driver.find_elements(By.XPATH, "//span[text()='$400.00']/../following-sibling::div/a[text()='Buy']")) > 0:
elem = driver.find_element(By.XPATH, "//span[text()='$400.00']/../following-sibling::div/a[text()='Buy']")
I've tried a few variations, including "preceding sibling::span[text()='x'", but can't seem to get the syntax correct or if I'm going about it the right way.
HTML is below. the current find_elements(By.XPATH...) correctly finds the "Total" and "Buy" class, I would like to add $20.00 in the "price" class as a condition also.
<ul>
<li class="List">
<div class="List-Content row">
<div class="Price">"$20.00"</div>
<div class="Quantity">10</div>
<div class="Change">0%</div>
<div class="Total">
<span>$400.00</span>
</div>
<div class="Buy">
<a class="Button">Buy</a>
</div>
</div>
</li>
</ul>
Using built in ElementTree
import xml.etree.ElementTree as ET
html = '''<li class="List">
<div class="List-Content row">
<div class="Price">"$20.00"</div>
<div class="Quantity">10</div>
<div class="Change">0%</div>
<div class="Total"><span>$400.00</span></div>
<div class="Buy"><a class="Button">Buy</a></div>
</div>
<div class="List-Content row">
<div class="Price">"$27.00"</div>
<div class="Quantity">10</div>
<div class="Change">0%</div>
<div class="Total"><span>$400.00</span></div>
<div class="Buy"><a class="Button">Buy</a></div>
</div>
</li>'''
items = {'Total':'$400.00','Buy':'Buy','Price':'"$20.00"'}
root = ET.fromstring(html)
first_level_divs = root.findall('div')
for first_level_div in first_level_divs:
results = {}
for k,v in items.items():
div = first_level_div.find(f'.div[#class="{k}"]')
one_level_down = len(list(div)) > 0
results[k] = list(div)[0].text if one_level_down else div.text
if results == items:
print('found')
else:
print('not found')
results = {}
output
found
not found
Given this HTML snippet
<ul>
<li class="List">
<div class="List-Content row">
<div class="Price">"$20.00"</div>
<div class="Quantity">10</div>
<div class="Change">0%</div>
<div class="Total"><span>$400.00</span></div>
<div class="Buy"><a class="Button">Buy</a></div>
</div>
</li>
</ul>
I would use this XPath:
buy_buttons = driver.find_elements(By.XPATH, """//div[
contains(#class, 'List-Content')
and div[#class = 'Price'] = '$20.00'
and div[#class = 'Total'] = '$400.00'
]//a[. = 'Buy']""")
for buy_button in buy_buttons:
print(buy_button)
The for loop replaces your if len(buy_buttons) > 0 check. It won't run when there are no results, so the if is superfluous.

Stale elements not getting resolved after waits, python

I am trying to go into every link on this page with the class of "course"
<a name="hrvatski-jezik" href="/pregled/predmet/29812177240/1971997880"><div class="course">
Hrvatski jezik <br>
<span class="course-info">Tamara Čer</span>
</div>
</a>
<a name="likovna-kultura" href="/pregled/predmet/29812176230/1971998890">
<div class="course">Likovna kultura <br>
<span class="course-info">Mia Marušić</span>
</div>
</a>
<a name="glazbena-kultura" href="/pregled/predmet/29812175220/1971999900">
<div class="course">Glazbena kultura <br>
<span class="course-info">Danijel Služek</span>
</div>
</a>
<a name="engleski-jezik" href="/pregled/predmet/29820696590/1972511970">
<div class="course">Engleski jezik <br>
<span class="course-info">Nevena Genčić</span>
</div>
</a>
<a name="matematika" href="/pregled/predmet/29812174210/1972000910">
<div class="course">Matematika <br>
<span class="course-info">Ivan Tomljanović</span>
</div></a>
<a name="biologija" href="/pregled/predmet/29812173200/1972001920">
<div class="course">Biologija <br>
<span class="course-info">Antonija Milić</span>
</div>
</a>
<a name="kemija" href="/pregled/predmet/29812172190/1972002930">
<div class="course">Kemija <br>
<span class="course-info">Antonija Milić</span>
</div>
</a>
<a name="fizika" href="/pregled/predmet/29812171180/1972003940">
<div class="course">Fizika <br>
<span class="course-info">Ivan Kunac</span>
</div>
</a>
<a name="povijest" href="/pregled/predmet/29812170170/1972004950">
<div class="course">Povijest <br>
<span class="course-info">Lovorka Krajnović Tot</span>
</div>
</a>
<a name="geografija" href="/pregled/predmet/29812169160/1972005960">
<div class="course">Geografija <br>
<span class="course-info">Sunčica Podolski <strong> (na zamjeni)</strong>, Oliver Timarac</span>
</div>
</a>
<a name="tehnicka-kultura" href="/pregled/predmet/29812168150/1972006970">
<div class="course">Tehnička kultura <br>
<span class="course-info">Ivan Dorotek</span>
</div>
</a>
<a name="tjelesna-i-zdravstvena-kultura" href="/pregled/predmet/29812167140/1972007980">
<div class="course">Tjelesna i zdravstvena kultura <br>
<span class="course-info">Davor Marković, Tomislav Ruskaj</span>
</div>
</a>
<a name="informatika" href="/pregled/predmet/29821462170/1972568530">
<div class="course">Informatika (izborni) <br>
<span class="course-info">Blaženka Knežević</span>
</div>
</a>
<a name="njemacki-jezik" href="/pregled/predmet/32658461270/1972646300"><div class="course">Njemački jezik (izborni) <br>
<span class="course-info">Zdravka Marković Boto</span>
</div>
</a>
<a name="rusinski-jezik-i-kultura" href="/pregled/predmet/32658491570/1972675590">
<div class="course">Rusinski jezik i kultura (izborni) <br>
<span class="course-info">Natalija Hnatko, Ilona Hrecešin</span>
</div>
</a>
<a name="sat-razrednika" href="/pregled/predmet/32322897860/2140793120">
<div class="course">Sat razrednika <br>
<span class="course-info">Blaženka Knežević</span>
</div>
</a>
<a name="izvannastavne-aktivnosti" href="/pregled/predmet/34285616720/2324344460">
<div class="course">Izvannastavne aktivnosti (izvannastavna aktivnost) <br>
<span class="course-info">Nevena Genčić, Ivan Kunac, Davor Marković, Josip Matezović, Antonija Milić, Tomislav Ruskaj, Danijel Služek</span>
</div>
</a>
`
I expect the code to go into every link, then go back and repeat.
It goes once into the try block and then 16 times into the except block.
For every except it gives StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
My code:
def get_subject():
subjects = driver.find_elements_by_xpath("//div[#class='course']")
for subject in subjects:
actions = ActionChains(driver)
actions.move_to_element(subject)
try:
actions.click()
actions.perform()
driver.back()
print("try")
time.sleep(3)
except Exception as e:
subjects = driver.find_elements_by_xpath("//div[#class='course']")
print("except")
print(e)
I know this is a very common problem. I tried implicit and explicit waits, I still got the same error.
I tried visibility_of_element_located, presence_of_element_located, staleness_of, I tried defining "subjects" again.
Help would be really appreciated, I've been searching for a solution for some time now.
I would suggest capture all links first and then iterate.
alllinks=[link.get_attribute('href') for link in driver.find_elements_by_css_selector("a[href^='/pregled/predmet']")]
for link in alllinks:
driver.get(link)
#Perform your operation
If you want to continue with your code just re-assigned the elements again.since you are using driver.back() your page is getting refreshed.
def get_subject():
subjects = driver.find_elements_by_xpath("//div[#class='course']")
for subject in range(len(subjects)):
subjects = driver.find_elements_by_xpath("//div[#class='course']")
actions = ActionChains(driver)
actions.move_to_element(subjects[subject])
try:
actions.click()
actions.perform()
driver.back()
print("try")
time.sleep(3)
except Exception as e:
subjects = driver.find_elements_by_xpath("//div[#class='course']")
print("except")
print(e)
actions.move_to_element(subject) is the culprit here.
all the element references in the subjects will be refreshed when you click on the subject in your try block, that's why you are getting the StateleElementException.
Try changing the code as shown below.
def get_subject():
subjects = len(driver.find_elements_by_xpath("//div[#class='course']"))
for counter in range(subjects):
subject = driver.find_elements_by_xpath("(//div[#class='course'])[" + str(counter) + "]")
actions = ActionChains(driver)
actions.move_to_element(subject)
try:
actions.click()
actions.perform()
driver.back()
print("try")
time.sleep(3)
except Exception as e:
print("except")
print(e)
driver.back() function doesn't guarantee to work. Instead, try using this driver.execute_script("window.history.go(-1)") and then re-assign the elements again inside your loop.

Iterate div tags using lxml and retrieve text for a dictionary in python

I just came to know lxmlx in python and I'm in the need for some help as I have no experience with XPath.
I want to get text data from a webpage into a dictionary.
I'm referring to the html snippet I posted below. Within the original html page there's a div element of the class general-info that I retrieve using the following line:
general_info = document_tree.xpath("//div[contains(concat(' ', normalize-space(#class), ' '), 'general-info')]")
From here on I want to iterate over the nested divs and get the 2 <p> tags as key and value. The text inside the <strong> being the key.
There can also be empty div tags and there can be a special case where the key and the value for the dictionary can be within the same div (see the last element).
EDIT:
The number of elements can change, so it would be best to use the <strong> tags as starting point and then search for the next <p> tag.
This is code that I was able to write using BeautifulSoup:
generalinfo = documentSoup.findAll("div", {"class": "general-info"})
if generalinfo:
strongs = generalinfo[0].find_all('strong')
for descr in strongs:
p = descr.find_next_sibling("p")
if p:
key = descr.text.strip().rstrip(':')
details_dict[key] = p.text.strip()
else:
nextdiv = descr.parent.parent.find_next_sibling("div")
if nextdiv:
child = nextdiv.findChild()
if child:
key = descr.text.strip()[:-1]
details_dict[key] = child.text.strip()
I am going for the following output:
['Title:' : 'This is a title',
'Owner:' : 'This is an owner',
'Category:' : 'This is a categroy',
'Type:' : 'This is a type',
'Special case:' : 'This is a special case']
If anyone can help me out here I will appreciate this!
html code:
<body>
<main>
<div>
...
<div class="general-info margin-bottom-20 margin-top-20">
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Title:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a title</p>
</div>
</div>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Owner:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is an owner</p>
</div>
</div>
<h2 class="h3 margin-top-10 margin-bottom-10 padding-x-20">Validity</h2>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Category:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a category</p>
</div>
</div>
<div class="row padding-x-40"></div>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Type:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a type</p>
</div>
</div>
<div class="row padding-x-40">
<div>
<strong>Special case:</strong>
<p>This is a special case</p>
</div>
</div>
</div>
...
I believe this is about as generalized as I can get given the html provided:
general_info = doc.xpath("//div[contains(concat(' ', normalize-space(#class), ' '), 'general-info')]//p[#class='margin-0']")
for i in general_info :
if len(i.xpath('./strong/text()'))>0:
topic = i.xpath('./strong/text()')[0]
if len(i.text.strip())>0:
entry += i.text.replace('\n','').strip()
print(topic+' '+i.text.replace('\n','').strip())
special = general_info[0].xpath('./ancestor::div[#class="general-info margin-bottom-20 margin-top-20"]//div/div/strong')[0]
print(special.text+" ",special.xpath('./following-sibling::p/text()')[0])
Output:
('Title: This is a title',
'Owner: This is an owner',
'Category: This is a category',
'Type: This is a type',
'Special case: This is a special case')
I recommend another solution, which is very suitable for extracting data from XML.
from simplified_scrapy.spider import SimplifiedDoc
html='''
<body>
<main>
<div class="general-info margin-bottom-20 margin-top-20">
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Title:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a title</p>
</div>
</div>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Owner:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is an owner</p>
</div>
</div>
<h2 class="h3 margin-top-10 margin-bottom-10 padding-x-20">Validity</h2>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Category:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a category</p>
</div>
</div>
<div class="row padding-x-40"></div>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Type:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a type</p>
</div>
</div>
<div class="row padding-x-40">
<div>
<strong>Special case:</strong>
<p>This is a special case</p>
</div>
</div>
</div>
'''
data={}
doc = SimplifiedDoc(html) # create doc
divs = doc.selects('div.general-info')
# First way
for div in divs:
strongs = div.strongs
for strong in strongs:
p = strong.next
if not p:
p=strong.parent.next
data[strong.text]=p.text
print(data)
data={}
# Second way
for div in divs:
ds = div.selects('strong|p>text()')
for i in range(0,len(ds),2):
data[ds[i]]=ds[i+1]
print(data)
Result:
{'Title:': 'This is a title', 'Owner:': 'This is an owner', 'Category:': 'This is a category', 'Type:': 'This is a type', 'Special case:': 'This is a special case'}
{'Title:': 'This is a title', 'Owner:': 'This is an owner', 'Category:': 'This is a category', 'Type:': 'This is a type', 'Special case:': 'This is a special case'}
Here are more examples:https://github.com/yiyedata/simplified-scrapy-demo/blob/master/doc_examples/

How to get index number of specific tag and class by searching some text?

I have following html
<ul class="vote_list clearfix" id="vote_div">
<li class="vote_one">
<div class="vote_show">
<div class="vote_T1">Chelsea</div>
<div class="vote_state">
<div class="vote_ST1">Votes:30000</div>
<div class="vote_ST2">Ranking:1</div>
</div>
</div>
<div class="vote_date">
<div class="vote_T1">Chelsea</div>
</div>
</li>
<li class="vote_one">
<div class="vote_show">
<div class="vote_T1">Arsenal</div>
<div class="vote_state">
<div class="vote_ST1">Votes:20000</div>
<div class="vote_ST2">Ranking:2</div>
</div>
</div>
<div class="vote_date">
<div class="vote_T1">Arsenal</div>
</div>
</li>
<li class="vote_one">
<div class="vote_show">
<div class="vote_T1">Liverpool</div>
<div class="vote_state">
<div class="vote_ST1">Votes:10000</div>
<div class="vote_ST2">Ranking:3</div>
</div>
</div>
<div class="vote_date">
<div class="vote_T1">Liverpool</div>
</div>
</li>
<ul>
I want to extract total vote of Chelsea, so it should show Votes: 30000
My idea is Which <li class="vote_one"> own Chelsea text and it should return 0 since Chelsea located on first vote_one element
But I don't know how to convert my idea to code.
Thanks in advance.
Finally solved #Idlehands
soup = BeautifulSoup(full_content, "lxml")
i=0
for vote_one_list in soup.find_all("li", class_="vote_one"):
if vote_one_list.find("div", class_="vote_show").find("div", class_="vote_T1").text == "Chelsea":
total_vote = soup.find_all("li", class_="vote_one")[i].find("div", class_="vote_show").find("div", class_="vote_state").find("div", class_="vote_ST1").text
rank = soup.find_all("li", class_="vote_one")[i].find("div", class_="vote_show").find("div", class_="vote_state").find("div", class_="vote_ST2").text
print "Chelsea | "+ rank + " | "+total_vote
i = i+1
Printing the votes and rank
The simplest way to get the votes for any given input would be:
input_str = 'Chelsea'
for vote in soup.find_all('div', class_='vote_show'):
if vote.find('div', class_='vote_T1').get_text().strip() == input_str:
print(vote.find('div', class_='vote_ST1').get_text().strip()) # Prints votes
print(vote.find('div', class_='vote_ST2').get_text().strip()) # Prints rank
The solution looks at all <div class='vote_show'> to check if text in the <div class='vote_T1'> is same as input string, Chelsea, for example.
I added the strip() so that you can find a match even if there are spaces around the string. If a match is found, the text of the contained <div class='vote_ST1'> is printed, again stripping any surrounding whitespace.
Printing the index
You can modify the for loop to use enumerate() as follows:
for idx, vote in enumerate(soup.find_all('div', class_='vote_show')):
if vote.find('div', class_='vote_T1').get_text().strip() == input_str:
print(idx) # prints index
print(vote.find('div', class_='vote_ST1').get_text().strip()) # prints votes
print(vote.find('div', class_='vote_ST2').get_text().strip()) # prints rank
Enumerate allows us to loop over something and have an automatic counter.
If you want to stop looking any further once you've found a match, you can add a break statement after the print() statement.

Extracting the right elements by text and span / Beautiful Soup / Python

Im trying to scrape following data:
Cuisine: 4.5
Service: 4.0
Quality: 4.5
But im having issues to scrape the right data. I tried following two Codes:
for bewertungen in soup.find_all('div', {'class' : 'histogramCommon bubbleHistogram wrap'}):
if bewertungen.find(text='Cuisine'):
cuisine = bewertungen.find(text='Cuisine')
cuisine = cuisine.next_element
print("test " + str(cuisine))
if bewertungen.find_all(text='Service'):
for s_bewertung in bewertungen.find_all('span', {'class':'ui_bubble_rating'}):
s_speicher = s_bewertung['alt']
In the first if i get no result. In the second If i get the right elements but i get all 3 results but i can not define which ones belongs to which text (Cuisine, Service, Quality)
Can someone give me an advice how to get the right data?
I put at the bottom the html code.
<div class="histogramCommon bubbleHistogram wrap">
<div class="colTitle">\nGesamtwertung\n</div>
<ul class="barChart">
<li>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Cuisine</span>
</div>
<div class="wrap row part ">
<span alt="4.5 of five" class="ui_bubble_rating bubble_45"></span>
</div>
</div>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Service</span>
</div>
<div class="wrap row part ">
<span alt="4.0 of five" class="ui_bubble_rating bubble_40"></span>
</div>
</div>
</li>
<li>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Quality</span>
</div>
<div class="wrap row part "><span alt="4.5 of five" class="ui_bubble_rating bubble_45"></span></div>
</div>
</li>
</ul>
</div>
Try this. According to the snippet you have pasted above, the following code should work:
from bs4 import BeautifulSoup
soup = BeautifulSoup(content,"lxml")
for item in soup.select(".ratingRow"):
category = item.select_one(".text").text
rating = item.select_one(".row span")['alt'].split(" ")[0]
print("{} : {}".format(category,rating))
Another way would be:
for item in soup.select(".ratingRow"):
category = item.select_one(".text").text
rating = item.select_one(".text").find_parent().find_next_sibling().select_one("span")['alt'].split(" ")[0]
print("{} : {}".format(category,rating))
Output:
Cuisine : 4.5
Service : 4.0
Quality : 4.5

Categories