Can't parse certain information applying conditional statement - python

I've written a script in python in combination with selenium to parse the emails of some companies out of a webpage. The problem is that the emails are either within span[data-mail] or span[data-mail-e-contact-mail]. If I try the two conditions seperately, I can get all the emails. However, when I try to wrap them within try:except:else block, they no longer work. Where am I going wrong?
website link
This is the script:
from selenium import webdriver
from bs4 import BeautifulSoup
url = "replace with the link above"
driver = webdriver.Chrome()
driver.get(url)
soup = BeautifulSoup(driver.page_source,'html.parser')
for links in soup.select("article.vcard"):
try: #the following works when tried individually
email = links.select_one(".hit-footer-wrapper span[data-mail]").get("data-mail")
except: #the following works as well when tried individually
email = links.select_one(".hit-footer-wrapper span[data-mail-e-contact-mail]").get("data-mail-e-contact-mail")
else:
email = ""
print(email)
driver.quit()
When I execute the above script, It prints nothing. They both work, if printed individually, though.

Note that exception will not be raised with your code as both get("data-mail") and get("data-mail-e-contact-mail") will return value (empty or not), but not exception
Try below code to get required output:
for links in soup.select("article.vcard"):
email = links.select_one(".hit-footer-wrapper span[data-mail]").get("data-mail") or links.select_one(".hit-footer-wrapper span[data-mail-e-contact-mail]").get("data-mail-e-contact-mail")
print(email)

Related

For Loops while using selenium for webscraping Python

I am attempting to web-scrape info off of the following website: https://www.axial.net/forum/companies/united-states-family-offices/
I am trying to scrape the description for each family office, so "https://www.axial.net/forum/companies/united-states-family-offices/"+insert_company_name" are the pages I need to scrape.
So I wrote the following code to test the program for just one page:
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome('insert_path_here/chromedriver')
driver.get("https://network.axial.net/company/ansaco-llp")
page_source = driver.page_source
soup2 = soup(page_source,"html.parser")
soup2.findAll('axl-teaser-description')[0].text
This works for the single page, as long as the description doesn't have a "show full description" drop down button. I will save that for another question.
I wrote the following loop:
#Note: Lst2 has all the names for the companies. I made sure they match the webpage
lst3=[]
for key in lst2[1:]:
driver.get("https://network.axial.net/company/"+key.lower())
page_source = driver.page_source
for handle in driver.window_handles:
driver.switch_to.window(handle)
word_soup = soup(page_source,"html.parser")
if word_soup.findAll('axl-teaser-description') == []:
lst3.append('null')
else:
c = word_soup.findAll('axl-teaser-description')[0].text
lst3.append(c)
print(lst3)
When I run the loop, all of the values come out as "null", even the ones without "click for full description" buttons.
I edited the loop to instead print out "word_soup", and the page is different then if I had run it without a loop and does not have the description text.
I don't understand why a loop would cause that but apparently it does. Does anyone know how to fix this problem?
Found solution. pause the program for 3 seconds after driver.get:
import time
lst3=[]
for key in lst2[1:]:
driver.get("https://network.axial.net/company/"+key.lower())
time.sleep(3)
page_source = driver.page_source
word_soup = soup(page_source,"html.parser")
if word_soup.findAll('axl-teaser-description') == []:
lst3.append('null')
else:
c = word_soup.findAll('axl-teaser-description')[0].text
lst3.append(c)
print(lst3)
I see that the page uses javascript to generate the text meaning it doesn't show up in the page source, which is weird but ok. I don't quite understand why you're only iterating through and switching to all the instances of Selenium you have open, but you definitely won't find the description in the page source / beautifulsoup.
Honestly, I'd personally look for a better website if you can, otherwise, you'll have to try it with selenium which is inefficient and horrible.

How to check if URL contains something

I'm making a program which goes to a url, clicks a button, checks if the page gets forwarded and if it does saves that url to a file.
However, after a couple of entries the page blocks you from doing anything. When this happens the URL changes and you'll get this Block.aspx?c=475412
Now how'd I be able to check if the url contains Block.aspx?c=475412 after each try?
I've tried looking for this but I could only find people asking how to get the current URL, not what I'm looking for, I need to check what the url contains.
Here is my code.
import selenium
from selenium import webdriver
url_list = open("path')
try:
driver = webdriver.Chrome("C:\\python\\chromedriver")
for url in url_list:
driver.get(url)
send = driver.find_element_by_id("NextButton")
send.click()
if (driver.find_elements_by_css_selector("a[class='Error']")):
print("Error class found")
except ValueError:
print("Something went wrong checking the URL.")
I suppose I'd add an if statement checking if the URL contains Block.aspx?c=475412, if anyone would be able to help me out I'd greatly appreciate it.
If you want to check what the URL contains, you can just use the in method built in with Python strings.
if "Block.aspx?c=475412" in driver.current_url: # check if "Block.aspx?c=475412" is in URL
print("Block.aspx is in the URL")

Cycle trough URLs from a txt

This is my first question so please bear with me (I have googled this and I did not find anything)
I'm making a program which goes to a url, clicks a button, checks if the page gets forwarded and if it does saves that url to a file.
So far I've got the first two steps done but I'm having some issues.
I want Selenium to repeat this process with multiple urls (if possible, multiple at a time).
I have all the urls in a txt called output.txt
At first I did
url_list = https://example.com
to see if my program even worked, and it did however I am stuck on how to get it to go to the next URL in the list and I am unable to find anything on the internet which helps me.
This is my code so far
import selenium
from selenium import webdriver
url_list = "C\\user\\python\\output.txt"
def site():
driver = webdriver.Chrome("C:\\python\\chromedriver")
driver.get(url_list)
send = driver.find_element_by_id("NextButton")
send.click()
if (driver.find_elements_by_css_selector("a[class='Error']")):
print("Error class found")
I have no idea as to how I'd get selenium to go to the first url in the list then go onto the second one and so forth.
If anyone would be able to help me I'd be very grateful.
I think the problem is that you assumed the name of the file containing the url, is a url. You need to open the file first and build the url list.
According to the docs https://selenium.dev/documentation/en/webdriver/browser_manipulation/, get expect a url, not a file path.
import selenium
from selenium import webdriver
with open("C\\user\\python\\output.txt") as f:
url_list = f.read().split('\n')
def site():
driver = webdriver.Chrome("C:\\python\\chromedriver")
for url in url_list:
driver.get(url)
send = driver.find_element_by_id("NextButton")
send.click()
if (driver.find_elements_by_css_selector("a[class='Error']")):
print("Error class found")

"How to fix 'malformed URL' in Selenium web scraping

My problem is that I am attempting to scrape the titles of Netflix movies and shows from a website that lists them on 146 different pages, so I made a loop to try and capture data from all the pages, however, when using the loop it makes my URL malformed and I don't know how to fix it.
I have made sure the webdriver part of the code works, meaning if I type in the URL to the driver.get it gives me the information I need, however when using the loop it pops up multiple firefox windows and doesnt put any URL into any of the windows. I also added a time delay to try and see if it was changing the URL before it got used but it still didn't work.
from selenium import webdriver
import time
for i in range(1,3):
URL = "https://flixable.com/?min-rating=0&min-year=1920&max-year=2019&order=date&page={}"
newURL = URL.format(i)
print(newURL)
time.sleep(10)
driver = webdriver.Firefox()
driver.get('newURL')
titles = driver.find_elements_by_css_selector('#filterContainer > div > div > p > strong > a')
for post in titles:
print(post.text)

Python: Why Selenium is not scraping the last webpage in the loop with Regex?

I am building a simple Selenium scraper. It should check for existence of a "contact" link, and then, if it exists, parse it for emails with Regex. If not, parse the very same page in which Selenium lands.
The problem is that although for the first three (randomly chosen) websites, the program gets the emails available, but for the last one, it NOT ONLY not scrape the page for emails, but also does not even close the browser. However, the loop seems to come to an end anyway, as output is "success". What am I doing wrong and why is it not scraping the last page in the dicti_pretty_links list? Code and output below:
import re
from selenium import webdriver
from bs4 import BeautifulSoup
import time, random
global scrapedEmails
scrapedEmails = []
#dicti_pretty_links = ['http://ayuda.ticketea.com/en/contact-us/','https://www.youtube.com/t/contact_us','http://www.haysplc.com/','http://madrid.usembassy.gov']
#http://www.iberia.com, http://madrid.usembassy.gov
dicti_pretty_links = ['http://www.haysplc.com/','https://www.youtube.com/t/contact_us','http://madrid.usembassy.gov','http://ayuda.ticketea.com/en/contact-us/',]
for el in dicti_pretty_links: #This converts page into Selenium object
browser = webdriver.Firefox()
page = browser.get(el)
time.sleep(random.uniform(0.5,1.5))
try: #Tries to open "contact" link
contact_link = browser.find_element_by_partial_link_text('ontact')
if contact_link:
contact_link.click()
except:
continue
html = browser.page_source #Loads up the page for Regex search
soup = BeautifulSoup(html,'lxml')
time.sleep(random.uniform(0.5,1.5))
emailRegex = re.compile(r'([a-zA-Z0-9_.+]+#[a-zA-Z0-9_.+.+]+)', re.VERBOSE)
mo = emailRegex.findall(html)
print('THIS BELOW IS SEL_emails_MO for',el)
print(mo)
for el in mo:
if el not in scrapedEmails: #Checks if emails is/adds to ddbb
scrapedEmails.append(el)
browser.close()
print(100*'-')
print('This below is scrappedEmails list')
print(scrapedEmails)
And this is the output of running the program above:
C:\Users\SK\AppData\Local\Programs\Python\Python35-32\python.exe C:/Users/SK/PycharmProjects/untitled/temperase
THIS BELOW IS SEL_emails_MO for http://www.haysplc.com/
['customerservice#hays.com', 'customerservice#hays.com', 'ir#hays.com', 'ir#hays.com', 'cosec#hays.com', 'cosec#hays.com', 'hays#team365.co.uk', 'hays#team365.co.uk']
THIS BELOW IS SEL_emails_MO for https://www.youtube.com/t/contact_us
['press#youtube.com.']
THIS BELOW IS SEL_emails_MO for http://madrid.usembassy.gov
['visasmadrid#state.gov', 'visasmadrid#state.gov', 'visasmadrid#state.gov', 'ivmadrid#state.gov', 'ivmadrid#state.gov', 'ivmadrid#state.gov', 'askACS#state.gov', 'askacs#state.gov', 'askACS#state.gov']
----------------------------------------------------------------------------------------------------
This below is scrappedEmails list
['customerservice#hays.com', 'ir#hays.com', 'cosec#hays.com', 'hays#team365.co.uk', 'press#youtube.com.', 'visasmadrid#state.gov', 'ivmadrid#state.gov', 'askACS#state.gov', 'askacs#state.gov']
Process finished with exit code 0
The problem is that on the http://ayuda.ticketea.com/en/contact-us/ page, there is no link (a element) with "ontact" partial link text. The browser.find_element_by_partial_link_text() call fails with a NoSuchElementException and the loop continues.
If you don't want to continue the loop if no link found, but instead try searching for email addresses on the current page, silently ignore the exception but don't continue the loop:
try:
contact_link = browser.find_element_by_partial_link_text('ontact')
if contact_link:
contact_link.click()
except:
print("No Contact link found")

Categories