I am trying to connect with elements that carry the contact numbers on each site. I was able to create the routine to get the numbers, extract the contact number with available formats and regex and the following code snippet to get the element
contact_elem = browser.find_elements_by_xpath("//*[contains(text(), '" + phone_num + "')]")
Considering the example of https://www.cssfirm.com/, the contact number appears in 2 locations, the top header and the bottom footer
The element texts accompanying the contact number are as follows :
<h3>CALL US TODAY AT (855) 910-7824</h3> - Footer
<span>Call Us<br>Today</span> (855) 910-7824 - Header
The extracted phone number matches perfectly while printing it out. For some reason, the element from the header part is not being detected.
I tried by searching for elements and even by deleting the footer element from the browser before executing the rest of the code.
What could be the reason for it to go undetected?
P.S: Below is the amateurish,uncorrected code. Efficiency edits/suggestions are welcome. The same code has been tested with various sites and works fine.
url = 'http://www.cssfirm.com/'
browser.get(url)
parsed = browser.find_element_by_tag_name('html').get_attribute('innerHTML')
s = BeautifulSoup(parsed, 'html.parser')
s = s.decode('utf-8')
phoneNumberRegex = '(\s*(?:\+?(\d{1,4}))?[-. (]*(\d{1,})[-. )]*(\d{3}|[A-Z0-9]+)[-. \/]*(\d{4}|[A-Z0-9]+)[-. \/]?(\d{4}|[A-Z0-9]+)?(?: *x(\d+))?\s*)'
custom_re = ['([0-9]{4,4} )([0-9]{3,3} )([0-9]{4,4})',
'([0-9]{3,3} )([0-9]{4,4} )([0-9]{4,4})',
'(\+[0-9]{2,2}-)([0-9]{4,4}-)([0-9]{4,4}-)(0)',
'(\([0-9]{3,3}\) )([0-9]{3,3}-)([0-9]{4,4})',
'(\+[0-9]{2,2} )(\(0\)[0-9]{4,4} )([0-9]{4,6})',
'([0-9]{5,5} )([0-9]{6,6})',
'(\+[0-9]{2,2}\(0\))([0-9]{4,4} )([0-9]{4,4})',
'(\+[0-9]{2,2} )([0-9]{3,3} )([0-9]{4,4} )([0-9]{3,3})',
'([0-9]{3,3}-)([0-9]{3,3}-)([0-9]{4,4})']
phones = []
phones = re.findall(phoneNumberRegex, s)
phone_num_list = ()
phone_num = ''
matched = 0
for phoneHeader in phones:
#phoneHeader = phoneHeader.decode('utf-8')
for ph_cnd in phoneHeader:
for pttrn in custom_re:
phones = re.findall(pttrn,ph_cnd)
if(phones):
phone_num_list = phones
for x in phone_num_list:
phone_num = ''.join(x)
try:
contact_elem = browser.find_element_by_xpath("//*[contains(text(), '" + phone_num + "')]")
phone_num_txt = contact_elem.text
if(phone_num_txt):
matched = 1
break
except NoSuchElementException:
pass
if(matched == 1):
break
if(matched == 1):
break
if(matched == 1):
break
print("Phone number :",phone_num) <-- Perfect output
contact_elem <--empty for header or just the footer element
EDIT
Code updated. Forgot an important piece. Moreover, there is sleep time given in between to give time for the page to load. Considering it trivial, I haven't included them for a quick read.
I found a temporary solution by searching for the partial link text, as the number also comes on the link.
contact_elem2 = browser.find_element_by_partial_link_text(phone_num)
However, this does not answer the generic question as to why that text was ignored within the element.
Related
I am trying to scrape data from a word document available at:-
https://dl.dropbox.com/s/pj82qrctzkw9137/HE%20Distributors.docx
I need to scrape the Name, Address, City, State, and Email ID. I am able to scrape the E-mail using the below code.
import docx
content = docx.Document('HE Distributors.docx')
location = []
for i in range(len(content.paragraphs)):
stat = content.paragraphs[i].text
if 'Email' in stat:
location.append(i)
for i in location:
print(content.paragraphs[i].text)
I tried to use the steps mentioned:
How to read data from .docx file in python pandas?
I need to convert this into a data frame with all the columns mentioned above.
Still facing issues with the same.
There are some inconsistencies in the document - phone numbers starting with Tel: sometimes, and Tel.: other times, and even Te: once, and I noticed one of the emails is just in the last line for that distributor without the Email: prefix, and the State isn't always in the last line.... Still, for the most part, most of the data can be extracted with regex and/or splits.
The distributors are separated by empty lines, and the names are in a different color - so I defined this function to get the font color of any paragraph from its xml:
# from bs4 import BeautifulSoup
def getParaColor(para):
try:
return BeautifulSoup(
para.paragraph_format.element.xml, 'xml'
).find('color').get('w:val')
except:
return ''
The try...except hasn't been necessary yet, but just in case...
(The xml is actually also helpful for double-checking that .text hasn't missed anything - in my case, I noticed that the email for Shri Adhya Educational Books wasn't getting extracted.)
Then, you can process the paragraphs from docx.Document with a function like:
# import re
def splitParas(paras):
ptc = [(
p.text, getParaColor(p), p.paragraph_format.element.xml
) for p in paras]
curSectn = 'UNKNOWN'
splitBlox = [{}]
for pt, pc, px in ptc:
# double-check for missing text
xmlText = BeautifulSoup(px, 'xml').text
xmlText = ' '.join([s for s in xmlText.split() if s != ''])
if len(xmlText) > len(pt): pt = xmlText
# initiate
if not pt:
if splitBlox[-1] != {}:
splitBlox.append({})
continue
if pc == '20752E':
curSectn = pt.strip()
continue
if splitBlox[-1] == {}:
splitBlox[-1]['section'] = curSectn
splitBlox[-1]['raw'] = []
splitBlox[-1]['Name'] = []
splitBlox[-1]['address_raw'] = []
# collect
splitBlox[-1]['raw'].append(pt)
if pc == 'D12229':
splitBlox[-1]['Name'].append(pt)
elif re.search("^Te.*:.*", pt):
splitBlox[-1]['tel_raw'] = re.sub("^Te.*:", '', pt).strip()
elif re.search("^Mob.*:.*", pt):
splitBlox[-1]['mobile_raw'] = re.sub("^Mob.*:", '', pt).strip()
elif pt.startswith('Email:') or re.search(".*[#].*[.].*", pt):
splitBlox[-1]['Email'] = pt.replace('Email:', '').strip()
else:
splitBlox[-1]['address_raw'].append(pt)
# some cleanup
if splitBlox[-1] == {}: splitBlox = splitBlox[:-1]
for i in range(len(splitBlox)):
addrsParas = splitBlox[i]['address_raw'] # for later
# join lists into strings
splitBlox[i]['Name'] = ' '.join(splitBlox[i]['Name'])
for k in ['raw', 'address_raw']:
splitBlox[i][k] = '\n'.join(splitBlox[i][k])
# search address for City, State and PostCode
apLast = addrsParas[-1].split(',')[-1]
maybeCity = [ap for ap in addrsParas if '–' in ap]
if '–' not in apLast:
splitBlox[i]['State'] = apLast.strip()
if maybeCity:
maybePIN = maybeCity[-1].split('–')[-1].split(',')[0]
maybeCity = maybeCity[-1].split('–')[0].split(',')[-1]
splitBlox[i]['City'] = maybeCity.strip()
splitBlox[i]['PostCode'] = maybePIN.strip()
# add mobile to tel
if 'mobile_raw' in splitBlox[i]:
if 'tel_raw' not in splitBlox[i]:
splitBlox[i]['tel_raw'] = splitBlox[i]['mobile_raw']
else:
splitBlox[i]['tel_raw'] += (', ' + splitBlox[i]['mobile_raw'])
del splitBlox[i]['mobile_raw']
# split tel [as needed]
if 'tel_raw' in splitBlox[i]:
tel_i = [t.strip() for t in splitBlox[i]['tel_raw'].split(',')]
telNum = []
for t in range(len(tel_i)):
if '/' in tel_i[t]:
tns = [t.strip() for t in tel_i[t].split('/')]
tel1 = tns[0]
telNum.append(tel1)
for tn in tns[1:]:
telNum.append(tel1[:-1*len(tn)]+tn)
else:
telNum.append(tel_i[t])
splitBlox[i]['Tel_1'] = telNum[0]
splitBlox[i]['Tel'] = telNum[0] if len(telNum) == 1 else telNum
return splitBlox
(Since I was getting font color anyway, I decided to add another
column called "section" to put East/West/etc in. And I added "PostCode" too, since it seems to be on the other side of "City"...)
Since "raw" is saved, any other value can be double checked manually at least.
The function combines "Mobile" into "Tel" even though they're extracted with separate regex.
I'd say "Tel_1" is fairly reliable, but some of the inconsistent patterns mean that other numbers in "Tel" might come out incorrect if they were separated with '/'.
Also, "Tel" is either a string or a list of strings depending on how many numbers there were in "tel_raw".
After this, you can just view as DataFrame with:
#import docx
#import pandas
content = docx.Document('HE Distributors.docx')
# pandas.DataFrame(splitParas(content.paragraphs)) # <--all Columns
pandas.DataFrame(splitParas(content.paragraphs))[[
'section', 'Name', 'address_raw', 'City',
'PostCode', 'State', 'Email', 'Tel_1', 'tel_raw'
]]
Blockquote
I am working on a project where I need to scrape data from a graph that shows the data for 1 day. For example: I want to get al the data for 2017 so I have to enter a new date in the datepicker 365 times. The problem is that although I am very specific in my XPATH call, the script finds way to many webelements. Many of which are not even compliant to my restrictions in my XPATH. This only happens some way into the loop and every loop the script finds more and more web elements.
The code that I am using:
Date_vec = pd.date_range(start="2017-01-01",end="2021-2-28")
DatePicker = web.find_element_by_xpath('/html/body/form/table/tbody/tr/td/table/tbody/tr/td[2]/div/div[1]/div[2]/div[2]/div[2]/div/div[2]/div/table/tbody/tr[2]/td/table/tbody/tr/td[2]/span/input')
month_prev = 0
year_prev = 0
for i in Date_vec:
DatePicker = web.find_element_by_xpath('/html/body/form/table/tbody/tr/td/table/tbody/tr/td[2]/div/div[1]/div[2]/div[2]/div[2]/div/div[2]/div/table/tbody/tr[2]/td/table/tbody/tr/td[2]/span/input')
DatePicker.click()
if i.year != year_prev:
# Year_button = web.find_element(By.XPATH,"//span[#onclick = 'basicDatePicker.ehYearSelectorClick(this)']")
Year_button = web.find_elements(By.XPATH,".//span[#onclick = 'basicDatePicker.ehYearSelectorClick(this)']")
Year_button[-1].click()
Year_choice = web.find_elements(By.XPATH,"//a[normalize-space(text()) ='"+ str(i.year) + "']")
Year_choice[-1].click()
elif i.month != month_prev:
Month_button = web.find_elements(By.XPATH,"//span[#onclick = 'basicDatePicker.ehMonthSelectorClick(this)']")
Month_button[-1].click()
Month_choice = web.find_elements_by_class_name('bdpMonthItem')
Month_choice[i.month-1].click()
Day_button = web.find_elements(By.XPATH,"//a[normalize-space(text()) ='"+ str(i.day) + "' and contains(#class, 'bdpDay')]")
Day_button[-1].click()
time.sleep(3)
month_prev = i.month
year_prev = i.year
For example, the problem arises at the line below:
Day_button = web.find_elements(By.XPATH,"//a[normalize-space(text()) ='"+ str(i.day) + "' and contains(#class, 'bdpDay')]")
This line returns 4 elements of which 2 don't have any text in them. I checked this with the following line.
test1 = [i.text for i in Day_button]
So my question basically is: Why does the line code return 4 elements of which two don't have text while I explicitly tell it to have the current day as text. Any help is appreciated.
edit: for clarity, I added a snip from the datepicker in question:
I am trying to open up several URL's (because they contain data I want to append to a list). I have a logic saying "if amount in icl_dollar_amount_l" then run the rest of the code. However, I want the script to only run the rest of the code on that specific amount in the variable "amount".
Example:
selenium opens up X amount of links and sees ['144,827.95', '5,199,024.87', '130,710.67'] in icl_dollar_amount_l but i want it to skip '144,827.95', '5,199,024.87' and only get the information for '130,710.67' which is in the 'amount' variable already.
Actual results:
Its getting webscaping information for amount '144,827.95' only and not even going to '5,199,024.87', '130,710.67'. I only want it getting webscaping information for '130,710.67' because my amount variable has this as the only amount.
print(icl_dollar_amount_l)
['144,827.95', '5,199,024.87', '130,710.67']
print(amount)
'130,710.67'
file2.py
def scrapeBOAWebsite(url,fcg_subject_l, gp_subject_l):
from ICL_Awk_Checker import rps_amount_l2
icl_dollar_amount_l = []
amount_ack_missing_l = []
file_total_l = []
body_l = []
for link in url:
print(link)
browser = webdriver.Chrome(options=options,
executable_path=r'\\TEST\user$\TEST\Documents\driver\chromedriver.exe')
# if 'P2 Cust ID 908554 File' in fcg_subject:
browser.get(link)
username = browser.find_element_by_name("dialog:username").get_attribute('value')
submit = browser.find_element_by_xpath("//*[#id='dialog:continueButton']").click()
body = browser.find_element_by_xpath("//*[contains(text(), 'Total:')]").text
body_l.append(body)
icl_dollar_amount = re.findall('(?:[\£\$\€]{1}[,\d]+.?\d*)', body)[0].split('$', 1)[1]
icl_dollar_amount_l.append(icl_dollar_amount)
if not missing_amount:
logging.info("List is empty")
print("List is empty")
count = 0
for amount in missing_amount:
if amount in icl_dollar_amount_l:
body = body_l[count]
get_file_total = re.findall('(?:[\£\$\€]{1}[,\d]+.?\d*)', body)[0].split('$', 1)[1]
file_total_l.append(get_file_total)
return icl_dollar_amount_l, file_date_l, company_id_l, client_id_l, customer_name_l, file_name_l, file_total_l, \
item_count_l, file_status_l, amount_ack_missing_l
I don't know if I understand problem but this
if amount in icl_dollar_amount_l:
doesn't give information on which position is '130,710.67' in icl_dollar_amount_l and you need also
count = icl_dollar_amount_l.index(amount)
for amount in missing_amount:
if amount in icl_dollar_amount_l:
count = icl_dollar_amount_l.index(amount)
body = body_l[count]
But it will works if you expect only one amount on list icl_dollar_amount_l. For more elements you would have to use rather for-loop and check every element separatelly
for amount in missing_amount:
for count, item in enumerate(icl_dollar_amount_l)
if amount == item :
body = body_l[count]
But frankly I don't know why you don't check it in first loop for link in url: when you have direct access to icl_dollar_amount and body
I am trying to scrape all the different variations of this webpage.For instance the code that should scrape this webpage http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=11849.
should be the same as the code i use to scrape this webpage
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=11849
def extract_contact(url):
r=requests.get(url)
soup=BeautifulSoup(r.content,'lxml')
tbl=soup.findAll('table')[2]
list=[]
Contact=tbl.findAll('p')[0]
for br in Contact.findAll('br'):
next = br.nextSibling
if not (next and isinstance(next,NavigableString)):
continue
next2 = next.nextSibling
if next2 and isinstance(next2,Tag) and next2.name == 'br':
text = re.sub(r'[\n\r\t\xa0]','',next).replace('Phone:','').strip()
list.append(text)
print list
#Street=list.pop(0)
#CityStateZip=list.pop(0)
#Phone=list.pop(0)
#City,StateZip= CityStateZip.split(',')
#State,Zip= StateZip.split(' ')
#ContactName = Contact.findAll('b')[1]
#ContactEmail = Contact.findAll('a')[1]
#Body=tbl.findAll('p')[1]
#Website = Contact.findAll('a')[2]
#Email = ContactEmail.text.strip()
#ContactName = ContactName.text.strip()
#Website = Website.text.strip()
#Body = Body.text
#Body = re.sub(r'[\n\r\t\xa0]','',Body).strip()
#list.extend([Street,City,State,Zip,ContactName,Phone,Email,Website,Body])
return list
The way i believe i will need to write the code in order it to work, is to set it up so that print list returns the same number of values, ordered identically.Currently, the above script returns these values
[u'2133 Craigs Store Road', u'Afton,VA 22920', u'434-882-3150']
[u'Alexandria,VA 22305']
Accounting for missing values,in order to be able to parse this page consistently,
I need the print list command to return something similar to this
[u'2133 Craigs Store Road', u'Afton,VA 22920', u'434-882-3150']
['',u'Alexandria,VA 22305','']
this way i will be able to manipulate values by position(as they will be in consistent order). The problem is that i don't know how to accomplish this as I am still very new to parsing. If anybody has any insight as to how to solve the problem i would be highly appreciative.
def extract_contact(url):
r=requests.get(url)
soup=BeautifulSoup(r.content,'lxml')
tbl=soup.findAll('table')[2]
list=[]
Contact=tbl.findAll('p')[0]
for br in Contact.findAll('br'):
next = br.nextSibling
if not (next and isinstance(next,NavigableString)):
continue
next2 = next.nextSibling
if next2 and isinstance(next2,Tag) and next2.name == 'br':
text = re.sub(r'[\n\r\t\xa0]','',next).replace('Phone:','').strip()
list.append(text)
Street=[s for s in list if ',' not in s and '-' not in s]
CityStateZip=[s for s in list if ',' in s]
Phone = [s for s in list if '-' in s]
if not Street:
Street=''
else:
Street=Street[0]
if not CityStateZip:
CityStateZip=''
else:
City,StateZip= CityStateZip[0].split(',')
State,Zip= StateZip.split(' ')
if not Phone:
Phone=''
else:
Phone=Phone[0]
list=[]
I figured out an alternative solution using substrings and if statements. Since there are only 3 values max in the list, all with defining characteristics i realized that i could delegate by looking for special characters rather than the position of the record.
So sorry about this vague and confusing title. But there is no really better way for me to summarize my problem in one sentence.
I was trying to get the student and grade information from a french website. The link is this (http://www.bankexam.fr/resultat/2014/BACCALAUREAT/AMIENS?filiere=BACS)
My code is as follows:
import time
import urllib2
from bs4 import BeautifulSoup
regions = {'R\xc3\xa9sultats Bac Amiens 2014':'/resultat/2014/BACCALAUREAT/AMIENS'}
base_url = 'http://www.bankexam.fr'
tests = {'es':'?filiere=BACES','s':'?filiere=BACS','l':'?filiere=BACL'}
for i in regions:
for x in tests:
# create the output file
output_file = open('/Users/student project/'+ i + '_' + x + '.txt','a')
time.sleep(2) #compassionate scraping
section_url = base_url + regions[i] + tests[x] #now goes to the x test page of region i
request = urllib2.Request(section_url)
response = urllib2.urlopen(request)
soup = BeautifulSoup(response,'html.parser')
content = soup.find('div',id='zone_res')
for row in content.find_all('tr'):
if row.td:
student = row.find_all('td')
name = student[0].strong.string.encode('utf8').strip()
try:
school = student[1].strong.string.encode('utf8')
except AttributeError:
school = 'NA'
result = student[2].span.string.encode('utf8')
output_file.write ('%s|%s|%s\n' % (name,school,result))
# Find the maximum pages to go through
if soup.find('div','pagination'):
import re
page_info = soup.find('div','pagination')
pages = []
for i in page_info.find_all('a',re.compile('elt')):
try:
pages.append(int(i.string.encode('utf8')))
except ValueError:
continue
max_page = max(pages)
# Now goes through page 2 to max page
for i in range(1,max_page):
page_url = '&p='+str(i)+'#anchor'
section2_url = section_url+page_url
request = urllib2.Request(section2_url)
response = urllib2.urlopen(request)
soup = BeautifulSoup(response,'html.parser')
content = soup.find('div',id='zone_res')
for row in content.find_all('tr'):
if row.td:
student = row.find_all('td')
name = student[0].strong.string.encode('utf8').strip()
try:
school = student[1].strong.string.encode('utf8')
except AttributeError:
school = 'NA'
result = student[2].span.string.encode('utf8')
output_file.write ('%s|%s|%s\n' % (name,school,result))
A little more description about the code:
I created a 'regions' dictionary and 'tests' dictionary because there are 30 other regions I need to collect and I just include one here for showcase. And I'm just interested in the test results of three tests (ES, S, L) and so I created this 'tests' dictionary.
Two errors keep showing up,
one is
KeyError: 2
and the error is linked to line 12,
section_url = base_url + regions[i] + tests[x]
The other is
TypeError: cannot concatenate 'str' and 'int' objects
and this is linked to line 10.
I know there is a lot of information here and I'm probably not listing the most important info for you to help me. But let me know how I can do to fix this!
Thanks
The issue is that you're using the variable i in more than one place.
Near the top of the file, you do:
for i in regions:
So, in some places i is expected to be a key into the regions dictionary.
The trouble comes when you use it again later. You do so in two places:
for i in page_info.find_all('a',re.compile('elt')):
And:
for i in range(1,max_page):
The second of these is what is causing your exceptions, as the integer values that get assigned to i don't appear in the regions dict (nor can an integer be added to a string).
I suggest renaming some or all of those variables. Give them meaningful names, if possible (i is perhaps acceptable for an "index" variable, but I'd avoid using it for anything else unless you're code golfing).