How can i get only name and contact number from div? - python

I'm trying to get name and contact number from div and div has three span, but the problem is that sometime div has only one span, some time two and sometime three span.
First span has name.
Second span has other data.
Third span has contact number
Here is HTML
<div class="ds-body-small" id="yui_3_18_1_1_1554645615890_3864">
<span class="listing-field" id="yui_3_18_1_1_1554645615890_3863">beth
budinich</span>
<span class="listing-field"><a href="http://Www.redfin.com"
target="_blank">See listing website</a></span>
<span class="listing-field" id="yui_3_18_1_1_1554645615890_4443">(206)
793-8336</span>
</div>
Here is my Code
try:
name= browser.find_element_by_xpath("//span[#class='listing-field'][1]")
name = name.text.strip()
print("name : " + name)
except:
print("Name are missing")
name = "N/A"
try:
contact_info= browser.find_element_by_xpath("//span[#class='listing-
field'][3]")
contact_info = contact_info.text.strip()
print("contact info : " + contact_info)
except:
print("contact_info are missing")
days = "N/A"
My code is not giving me correct result. Can anyone provide me best possible solution. Thanks

You can iterate throw contacts and check, if there's child a element and if match phone number pattern:
contacts = browser.find_elements_by_css_selector("span.listing-field")
contact_name = []
contact_phone = "N/A"
contact_web = "N/A"
for i in range(0, len(contacts)):
if len(contacts[i].find_elements_by_tag_name("a")) > 0:
contact_web = contacts[i].find_element_by_tag_name("a").get_attribute("href")
elif re.search("\\(\\d+\\)\\s+\\d+-\\d+", contacts[i].text):
contact_phone = contacts[i].text
else:
contact_name.append(contacts[i].text)
contact_name = ", ".join(contact_name) if len(contact_name) > 0 else "N/A"
Output:
contact_name: ['Kevin Howard', 'Howard enterprise']
contact_phone: '(206) 334-8414'
The page has captcha. To scrape better to use requests, all information provided in json format.

#sudharsan
# April 07 2019
from bs4 import BeautifulSoup
text ='''<div class="ds-body-small" id="yui_3_18_1_1_1554645615890_3864">
<span class="listing-field" id="yui_3_18_1_1_1554645615890_3863">beth
budinich</span>
<span class="listing-field"><a href="http://Www.redfin.com"
target="_blank">See listing website</a></span>
<span class="listing-field" id="yui_3_18_1_1_1554645615890_4443">(206)
793-8336</span>
</div>'''
# the given sample html is stored as a input in variable called "text"
soup = BeautifulSoup(text,"html.parser")
main = soup.find(class_="listing-field")
# Now the spans with class name "listing-field" is stored as list in "main"
print main[0].text
# it will print the first span element
print main[-1].text
# it will print the last span element
#Thank you
# if you like the code "Vote for it"

Related

Clicking a button within searched table using Python Selenium

I am trying to search into a table for a specific value (Document ID) and then press a button that is next to that column (Retire). I should add here that the 'Retire' button is only visible once the mouse is hovered over, but I have built that into my code which I'll share further down.
So for example:
My Document ID would be 0900766b8001b6a3, and I would want to click the button called 'Retire'. The issue I'm having is pulling the XPaths for the retire buttons, this needs to be dynamic. I got it working for some Document IDs that had a common link, for example:
A700000007201082 Xpath = //[#id="retire-7201082"] (you can see the commonality here with the Document ID ending in the same as the Xpath number 7201082. Whereas in the first example, the xpath for '0900766b8001b6a3' = //[#id="retire-251642"], you can see the retire number here is completely random to the Document ID and therefore hard to manually build the Xpath.
Here is my code:
before_XPath = "//*[#class='wp-list-table widefat fixed striped table-view-list pages']/tbody/tr["
aftertd_XPath_1 = "]/td[1]"
aftertd_XPath_2 = "]/td[2]"
aftertd_XPath_3 = "]/td[3]"
before_XPath_1 = "//*[#class='wp-list-table widefat fixed striped table-view-list pages']/tbody/tr[1]/th["
before_XPath_2 = "//*[#class='wp-list-table widefat fixed striped table-view-list pages']/tbody/tr[2]/td["
aftertd_XPath = "]/td["
after_XPath = "]"
aftertr_XPath = "]"
search_text = "0900766b8001af05"
time.sleep(10)
num_rows = len(driver.find_elements_by_xpath("//*[#class='wp-list-table widefat fixed striped table-view-list pages']/tbody/tr"))
num_columns = len (driver.find_elements_by_xpath("//*[#class='wp-list-table widefat fixed striped table-view-list pages']/tbody/tr[2]/td"))
elem_found = False
for t_row in range(2, (num_rows + 1)):
for t_column in range(1, (num_columns + 1)):
FinalXPath = before_XPath + str(t_row) + aftertd_XPath + str(t_column) + aftertr_XPath
cell_text = driver.find_element_by_xpath(FinalXPath).text
if ((cell_text.casefold()) == (search_text.casefold())):
print("Search Text "+ search_text +" is present at row " + str(t_row) + " and column " + str(t_column))
elem_found = True
achains = ActionChains(driver)
move_to = driver.find_element_by_xpath("/html/body/div[1]/div[2]/div[2]/div[1]/div[3]/form[1]/table/tbody/tr[" + str(t_row) + "]/td[2]")
achains.move_to_element(move_to).perform()
retire_xpath = driver.find_element_by_xpath("//*[#id='retire-"+ str(search_text[-7:])+"']")
time.sleep(6)
driver.execute_script("arguments[0].click();", move_to)
time.sleep(6)
driver.switch_to.alert.accept()
break
if (elem_found == False):
print("Search Text "+ search_text +" not found")
This particular bit of code lets me handle any Document IDs such as 'A700000007201082' as I can just cut off the part I need and build it into an XPath:
retire_xpath = driver.find_element_by_xpath("//*[#id='retire-"+ str(search_text[-7:])+"']")
I've tried to replicate the above for the Doc IDs starting with 09007, but I can't find how to pull that unique number as it isn't anywhere accessible in the table.
I am wondering if there's something I can do to build it the same way I have above or perhaps focus on the index? Any advice is much appreciated, thanks.
EDIT:
This is the HTML code for the RETIRE button for Document ID: 0900766b8001b6a3
<span class="retire"><button id="retire-251642" data-document-id="251642" rel="bookmark" aria-label="Retire this document" class="rs-retire-link">Retire</button></span>
You can see the retire button id is completely different to the Document ID. Here is some HTML code just above it which I think could be useful:
<div class="hidden" id="inline_251642">
<div class="post_title">General Purpose Test Kit Lead</div><div class="post_name">0900766b8001b6a3</div>
<div class="post_author">4</div>
<div class="comment_status">closed</div>
<div class="ping_status">closed</div>
<div class="_status">publish</div>
<div class="jj">30</div>
<div class="mm">03</div>
<div class="aa">2001</div>
<div class="hh">15</div>
<div class="mn">43</div>
<div class="ss">03</div>
<div class="post_password"></div><div class="post_parent">0</div><div class="page_template">default</div><div class="tags_input" id="rs-language-code_251642">de, en, fr, it</div><div class="tags_input" id="rs-current-state_251642">public</div><div class="tags_input" id="rs-doc-class-code_251642">rs_instruction_sheet</div><div class="tags_input" id="rs-restricted-countries_251642"></div></div>
Would it be possible to call the div class "post_name" as this has the correct doc ID, and the press the RETIRE button for that specific Doc ID?
Thank you.

Xpath How to check if a child element exist given the parent node?

I wanted to loop over the parent node and check if a parent node has a certain child and extract data from it.
The script of the website is something like this:
<div #class="reviews">
<div #id = "user1">
<div #class="name"> Will </div>
<div #class="weight"> 50kg </div>
<div #class="height"> 160cm </div>
</div>
<div #id = "user2">
<div #class="weight"> 55kg </div>
<div #class="height"> 170cm </div>
</div>
<div #id = "user3">
<div #class="name"> Ben </div>
<div #class="height"> 180cm </div>
</div>
</div>
My code so far looks something like this:
import csv
import os
import pandas as pd
from selenium import webdriver
chromedriver = "path to chromedriver"
driver = webdriver.Chrome(chromedriver)
driver.get( 'url of a website')
name_row = []
weight_row = []
height_row = []
for i in range(len(driver.find_element_by_xpath('//div[#class="reviews"/div'):
# Get the first parent (user1)
driver.find_element_by_xpath('(//div[#class="reviews"/div)' + '[' + str(i + 1) + ']')
# Check if it has elements like name, weight, and height and add it to appropriate list.
# For example, name_row.append(driver.find_element_by_xpath("xpath to name if it exists")
# If missing any element return "None"
# Then move on to the second parent (user2) and so on
df = pd.DataFrame({'Names': name_row, 'Weight': weight_row, 'Height': height_row})
I want my end result to look like
Name
Weight
Height
Will
50kg
160cm
None
55kg
170cm
Ben
None
180cm
I've looked at other posts too but just can't seem to find the answer I'm looking for.
I tried just doing find_elements_by_xpath and put each name, weight, and height values in their respective list but this would not include the "None" value in any of these list and end up giving me error of array not being the same length and whatnot.
I would suggest something like this:
import csv
import os
import pandas as pd
from selenium import webdriver
chromedriver = "path to chromedriver"
driver = webdriver.Chrome(chromedriver)
driver.get( 'url of a website')
names = []
weights = []
heights = []
#wait for first parent element to be visible
wait.until(EC.visibility_of_element_located((By.XPATH, "//div[#class='reviews']")))
#let all the elements loaded
time.sleep(0.5)
#get all the reviews list
reviews = driver.find_elements_by_xpath("//div[#class='reviews']")
#iterate over reviews and get each inner element per each review
for i in range(len(reviews)):
#get name element for current review. This returns list of web elements
name = review[i].find_elements_by_xpath(".//div[#class='name']")
#in case the element exists the list is non-empty so it is interpreted as Boolean True
if(name):
#extract actual element value and append it to the list of names
names.append(name[0].text)
#otherwise append "None"
else:
names.append("None")
#the same for other parameters
weight = review[i].find_elements_by_xpath(".//div[#class='weight']")
if(weight):
weights.append(weight[0].text)
else:
weights.append("None")
height = review[i].find_elements_by_xpath(".//div[#class='height']")
if(height):
heights.append(height[0].text)
else:
heights.append("None")
Now you got the parents element, you can now again for the child element inside parent element like for you
parents=driver.find_element_by_xpath('(//div[#class="reviews"/div).find_elements_by_tag_name("div")
For do loop using parent elements, and check inside that if other exist or not.
for parent in parents:
parent.find_element_by_class_name('name')
....weight
Height
Here's documentation
Use this when you want to locate an element by class name. With this strategy, the first element with the matching class name attribute will be returned. If no element has a matching class name attribute, a NoSuchElementException will be raised.
So, now you can do try except while checking for name, weight.. and if found get data if not write None

BeautifulSoup find - exclude nested tag from block of interest

I have a scraper that looks for pricing on particular product pages. I'm only interested in the current price - whether the product is on sale or not.
I store the identifying tags like this in a JSON file:
{
"some_ecommerce_site" : {
"product_name" : ["span", "data-test", "product-name"],
"breadcrumb" : ["div", "class", "breadcrumbs"],
"sale_price" : ["span", "data-test", "sale-price"],
"regular_price" : ["span", "data-test", "product-price"]
},
}
And have these functions to select current price and clean up the price text:
def get_pricing(rpi, spi):
sale_price = self.soup_object.find(spi[0], {spi[1] : spi[2]})
regular_price = self.soup_object.find(rpi[0], {rpi[1] : rpi[2]})
return sale_price if sale_price else regular_price
def get_text(obj):
return re.sub(r'\s\s+', '', obj.text.strip()).encode('utf-8')
Which are called by:
def get_ids(name_of_ecommerce_site):
with open('site_identifiers.json') as j:
return json.load(j)[name_of_ecommerce_site]
def get_data():
rpi = self.site_ids['regular_price']
spi = self.site_ids['sale_price']
product_price = self.get_text( self.get_pricing(rpi, spi) )
This works for all but one site so far because their pricing is formatted like so:
<div class="product-price">
<h3>
£15.00
<span class="price-standard">
£35.00
</span>
</h3>
</div>
So what product_price returns is "£15£35" instead of the desired "£15".
Is there a simple way to exclude the nested <span> which won't break for the working sites?
I thought a solution would be to get a list and select index 0, but checking the tag's contents, that won't work as it's a single item in the list:
>> print(type(regular_price))
>> <class 'bs4.element.Tag'>
>> print(regular_price.contents)
>> [u'\n', <h3>\n\n\xa325.00\n\n<span class="price-standard">\n\n\xa341.00\n</span>\n</h3>, u'\n']
I've tried creating a list out of the result's NavigableString elements then filtering out the empty strings:
filter(None, [self.get_text(unicode(x)) for x in sale_price.find_all(text=True)])
This fixes that one case, but breaks a few of the others (since they often have the currency in a different tag than the value amount) - I get back "£".
If you want to get the text without child element one.You can do like this
from bs4 import BeautifulSoup,NavigableString
html = """
<div class="product-price">
<h3>
£15.00
<span class="price-standard">
£35.00
</span>
</h3>
</div>
"""
bs = BeautifulSoup(html,"xml")
result = bs.find("div",{"class":"product-price"})
fr = [element for element in result.h3 if isinstance(element, NavigableString)]
print(fr[0])
question may be duplicate of this

Extracting data from an inconsistent HTML page using BeautifulSoup4 and Python

I’m trying to extract data from this webpage and I'm having some trouble due to inconsistancies within the page's HTML formatting. I have a list of OGAP IDs and I want to extract the Gene Name and any literature information (PMID #) for each OGAP ID I iterate through. Thanks to other questions on here, and the BeautifulSoup documentation, I've been able to consistantly get the gene name for each ID, but I'm having trouble with the literature part. Here's a couple search terms that highlight the inconsitancies.
HTML sample that works
Search term: OG00131
<tr>
<td colspan="4" bgcolor="#FBFFCC" class="STYLE28">Literature describing O-GlcNAcylation:
<br> PMID:
20068230
[CAD, ETD MS/MS]; <br>
<br>
</td>
</tr>
HTML sample that doesn't work
Search term: OG00020
<td align="top" bgcolor="#FBFFCC">
<div class="STYLE28">Literature describing O-GlcNAcylation: </div>
<div class="STYLE28">
<div class="STYLE28">PMID:
16408927
[Azide-tag, nano-HPLC/tandem MS]
</div>
<br>
Site has not yet been determined. Use
OGlcNAcScan
to predict the O-GlcNAc site. </div>
</td>
Here's the code I have so far
import urllib2
from bs4 import BeautifulSoup
#define list of genes
#initialize variables
gene_list = []
literature = []
# Test list
gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"]
for i in range(len(gene_listID)):
print gene_listID[i]
# Specifies URL, uses the "%" to sub in different ogapIDs based on a list provided
dbOGAP = "https://wangj27.u.hpc.mssm.edu/cgi-bin/DB_tb.cgi?textfield=%s&select=Any" % gene_listID[i]
# Opens the URL as a page
page = urllib2.urlopen(dbOGAP)
# Reads the page and parses it through "lxml" format
soup = BeautifulSoup(page, "lxml")
gene_name = soup.find("td", text="Gene Name").find_next_sibling("td").text
print gene_name[1:]
gene_list.append(gene_name[1:])
# PubMed IDs are located near the <td> tag with the term "Data and Source"
pmid = soup.find("span", text="Data and Source")
# Based on inspection of the website, need to move up to the parent <td> tag
pmid_p = pmid.parent
# Then we move to the next <td> tag, denoted as sibling (since they share parent <tr> (Table row) tag)
pmid_s = pmid_p.next_sibling
#for child in pmid_s.descendants:
# print child
# Now we search down the tree to find the next table data (<td>) tag
pmid_c = pmid_s.find("td")
temp_lit = []
# Next we print the text of the data
#print pmid_c.text
if "No literature is available" in pmid_c.text:
temp_lit.append("No literature is available")
print "Not available"
else:
# and then print out a list of urls for each pubmed ID we have
print "The following is available"
for link in pmid_c.find_all('a'):
# the <a> tag includes more than just the link address.
# for each <a> tag found, print the address (href attribute) and extra bits
# link.string provides the string that appears to be hyperlinked.
# In this case, it is the pubmedID
print link.string
temp_lit.append("PMID: " + link.string + " URL: " + link.get('href'))
literature.append(temp_lit)
print "\n"
So it seems the element is what is throwing the code for a loop. Is there a way to search for any element with the text "PMID" and return the text that comes after it (and url if there is a PMID number)? If not, would I just want to check each child, looking for the text I'm interested in?
I'm using Python 2.7.10
import requests
from bs4 import BeautifulSoup
import re
gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"]
urls = ('https://wangj27.u.hpc.mssm.edu/cgi-bin/DB_tb.cgi?textfield={}&select=Any'.format(i) for i in gene_listID)
for url in urls:
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
regex = re.compile(r'http://www.ncbi.nlm.nih.gov/pubmed/\d+')
a_tag = soup.find('a', href=regex)
has_pmid = 'PMID' in a_tag.previous_element
if has_pmid :
print(a_tag.text, a_tag.next_sibling, a_tag.get("href"))
else:
print("Not available")
out:
18984734 [GalNAz-Biotin tagging, CAD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/18984734
20068230 [CAD, ETD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/20068230
20068230 [CAD, ETD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/20068230
Not available
16408927 [Azide-tag, nano-HPLC/tandem MS]; http://www.ncbi.nlm.nih.gov/pubmed/16408927
Not available
16408927 [Azide-tag, nano-HPLC/tandem MS] http://www.ncbi.nlm.nih.gov/pubmed/16408927?dopt=Citation
find the first a tag that match the target url, which end with numbers, than check if 'PMID' in it's previous element.
this web is so inconsistancies , and i try many times, hope this would help

BeautifulSoup to scrape street address

I am using the code at the far bottom to get weblink, and the Masjid name. however I would like to also get denomination and street address. please help I am stuck.
Currently I am getting the following
Weblink:
<div class="subtitleLink"><a href="http://www.salatomatic.com/d/Tempe+5313+Masjid-Al-Hijrah">
and Masjid name
<b>Masjid Al-Hijrah</b>
But would like to get the below;
Denomination
<b>Denomination:</b> Sunni (Traditional)
and street address
<br>45 Station Street (Sydney)
The below code scrapes the following
<td width=25><img src='http://www.halalfire.com/images/en/photo_small.jpg' alt='Masjid Al-Hijrah' title='Masjid Al-Hijrah' border=0 width=48 height=36></a></td><td width=10><img src="http://www.salatomatic.com/images/spacer.gif" width=10 border=0></td><td nowrap><div class="subtitleLink"><b>Masjid Al-Hijrah</b> </div><div class="tinyLink"><b>Denomination:</b> Sunni (Traditional)<br>45 Station Street (Sydney) </div></td><td align=right valign=center><div class="tinyLink"></div></td>
CODE:
from bs4 import BeautifulSoup
import urllib2
url1 = "http://www.salatomatic.com/c/Sydney+168"
content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1)
results = soup.findAll("div", {"class" : "subtitleLink"})
for result in results :
br = result.find('b')
a = result.find('a')
currenturl = a.get('href')
if not currenturl.startswith("http"):
currenturl = "http://www.salatomatic.com" + currenturl
print currenturl
elif currenturl.startswith("http"):
print a.get('href')
pos = br.get_text()
print pos
You can check next <div> element with a class attribute with value tinyLink and that contains either a <b> and a <br> tags and extract their strings:
...
print pos
div = result.find_next_sibling('div', attrs={"class": "tinyLink"})
if div and div.b and div.br:
print(div.b.next_sibling.string)
print(div.br.next_sibling.string)

Categories