Python print results which contains specific string in it - python

I am trying to get google search result description.
from selenium import webdriver
import re
chrome_path = r"C:\Users\xxxx\Downloads\Compressed\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get("https://www.google.co.in/search?q=stackoverflow")
posts = driver.find_elements_by_class_name("st")
for post in posts:
print(post.text)
Here Im getting correct results.
But I only want to print links from description.
And want to get results from 5 google search pages.
Here I am only getting from 1 page.
I have tried using
print(post.get_attribute('href'))
but description links are not clickable so this returns None.

Try the below code:
for i in range(1, 6, 1):
print("--------------------------------------------------------------------")
print("Page "+str(i)+" Results : ")
print("--------------------------------------------------------------------")
staticLinks = driver.find_elements_by_xpath("//*[#class='st']")
for desc in staticLinks:
txt = desc.text+''
if txt.count('http://') > 0 or txt.count('https://') > 0:
for c in txt.split():
if c.startswith('http') or c.startswith('https'):
print(c)
dynamicLinks = driver.find_elements_by_xpath("//*[#class='st']//a")
for desc in dynamicLinks:
link = desc.get_attribute('href')
if link is not None:
print(link)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
nextPage = driver.find_element_by_xpath("//a[#aria-label='Page "+str(i+1)+"']");
nextPage.click();
Will try to fetch the static & dynamic links from the google's first 5 search results description.

Related

Getting text from multiple webpages(Pagination) in selenium python

I wanted to extract text from multiple pages. Currently, I am able to extract data from the first page but I want to append and go to muliple pages and extract the data from pagination. I have written this simple code which extracts data from the first page. I am not able to extract the data from multiple pages which is dynamic in number.
`
element_list = []
opts = webdriver.ChromeOptions()
opts.headless = True
driver = webdriver.Chrome(ChromeDriverManager().install())
base_url = "XYZ"
driver.maximize_window()
driver.get(base_url)
driver.set_page_load_timeout(50)
element = WebDriverWait(driver, 50).until(EC.presence_of_element_located((By.ID, 'all-my-groups')))
l = []
l = driver.find_elements_by_xpath("//div[contains(#class, 'alias-wrapper sim-ellipsis sim-list--shortId')]")
for i in l:
print(i.text)
`
I have shared the images of class if this could help from pagination.
If we could extract the automate and extract from all the pages that would be awesome. Also, I am new so please pardon me for asking silly questions. Thanks in advance.
You have provided the code just for the previous page button. I guess you need to go to the next page until next page exists. As I don't know what site we are talking about I can only guess its behavior. So I'm assuming the button 'next' disappears when no next page exists. If so, it can be done like this:
element_list = []
opts = webdriver.ChromeOptions()
opts.headless = True
driver = webdriver.Chrome(ChromeDriverManager().install())
base_url = "XYZ"
driver.maximize_window()
driver.get(base_url)
driver.set_page_load_timeout(50)
element = WebDriverWait(driver, 50).until(EC.presence_of_element_located((By.ID, 'all-my-groups')))
l = []
l = driver.find_elements_by_xpath("//div[contains(#class, 'alias-wrapper sim-ellipsis sim-list--shortId')]")
while True:
try:
next_page = driver.find_element(By.XPATH, '//button[#label="Next page"]')
except NoSuchElementException:
break
next_page.click()
l.extend(driver.find_elements(By.XPATH, "//div[contains(#class, 'alias-wrapper sim-ellipsis sim-list--shortId')]"))
for i in l:
print(i.text)
To be able to catch the exception this import has to be added:
from selenium.common.exceptions import NoSuchElementException
Also note that the method find_elements_by_xpath is deprecated and it would be better to replace this line:
l = driver.find_elements_by_xpath("//div[contains(#class, 'alias-wrapper sim-ellipsis sim-list--shortId')]")
by this one:
l = driver.find_elements(By.XPATH, "//div[contains(#class, 'alias-wrapper sim-ellipsis sim-list--shortId')]")

moving to the next page till the end with selenium in python

I want to retrieve from the link below the first page can be retrieve but I have a problem for putting the loop for the next page till the end. May you help me and complete my code?
My link is:
https://www150.statcan.gc.ca/n1/pub/71-607-x/2021004/exp-eng.htm?r1=(1)&r2=0&r3=0&r4=12&r5=0&r7=0&r8=2022-02-01&r9=2022-02-01
from selenium import webdriver
import time
url = "https://www150.statcan.gc.ca/n1/pub/71-607-x/2021004/exp-eng.htm?r1=(1)&r2=0&r3=0&r4=12&r5=0&r7=0&r8=2022-02-01&r9=2022-02-01"
driver = webdriver.Chrome("C:\Program Files\Python310\chromedriver.exe")
driver.get(url)
table = driver.find_element_by_id('report_table')
body = table.find_element_by_tag_name('tbody')
cells = body.find_elements_by_tag_name('td')
for cell in cells:
print(cell.text)
it brings the first page data but I don't know how to retrieve the others.
Look for the next-page selector and iterate over it, if it is there, let it click after your extraction part. You wanna do that in a while loop for example which you break if selector can't be found.
from selenium import webdriver
import time
url = "https://www150.statcan.gc.ca/n1/pub/71-607-x/2021004/exp-eng.htm?r1=(1)&r2=0&r3=0&r4=12&r5=0&r7=0&r8=2022-02-01&r9=2022-02-01"
driver = webdriver.Chrome("C:\Program Files\Python310\chromedriver.exe")
driver.get(url)
table = driver.find_element_by_id('report_table')
body = table.find_element_by_tag_name('tbody')
cells = body.find_elements_by_tag_name('td')
for cell in cells:
print(cell.text)
while True:
next_page = driver.find_element(By.XPATH, '//a[#id="report_results_next"]')
if next_page:
# steps to extract if next_page
driver.get(next_page)
table = driver.find_element_by_id('report_table')
body = table.find_element_by_tag_name('tbody')
cells = body.find_elements_by_tag_name('td')
for cell in cells:
print(cell.text)
else:
# stop
break
This is not tested.

scrape a div with auto generated class with python selenium

Hello I'm trying to scrape some questions from a web forum
I am able to scrape questions with a
find_elements_by_xpath
it's something like this :
questions = driver.find_elements_by_xpath('//div[#class="autu-generated"]//div[#class="corpus"]//div[#class="body-bd"]//p')
I made a diagram so you can understand my situation :
my problem is if I didn't specify the auto-generated class in the XPath it's gonna return all the values from the other divs (which I don't want )
and writing the auto-generated class manually like I did to test isn't a valid idea because I'm scraping multiple questions with multiple classes
do you guys have any ideas on how to resolve this problem ??
here is the web forum
thank you
my code :
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
import time
from fastparquet.parquet_thrift.parquet.ttypes import TimeUnit
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import pandas as pd
driver = webdriver.Chrome('/Users/ossama/Downloads/chromedriver_win32/chromedriver')
page = 1
#looping in pages
while page <= 10:
driver.get('https://forum.bouyguestelecom.fr/questions/browse?flow_state=published&order=created_at.desc&page='+str(page)+'&utf8=✓&search=&with_category%5B%5D=2483')
# checking to click the pop-up cookies interfaces
if page == 1:
#waiting 10s for the pop-up to show up before accepting it
time.sleep(10)
driver.find_element_by_id('popin_tc_privacy_button_3').click()
# store all the links in a list
#question_links = driver.find_elements_by_xpath('//div[#class="corpus"]//a[#class="content_permalink"]')
links = driver.find_elements_by_xpath('//div[#class="corpus"]//a[#class="content_permalink"]')
forum_links= []
for link in links:
value = link.get_attribute("href")
print(value)
forum_links.append(value)
else:
links = driver.find_elements_by_xpath('//div[#class="corpus"]//a[#class="content_permalink"]')
for link in links:
value = link.get_attribute("href")
print(value)
forum_links.append(value)
q_df = pd.DataFrame(forum_links)
q_df.to_csv('forum_links.csv')
page = page + 1
for link in forum_links:
driver.get(link)
#time.sleep(5)
#driver.find_element_by_id('popin_tc_privacy_button_3').click()
questions = driver.find_elements_by_xpath('//div[#class="corpus"]//div[#class="body-bd"]//p')
authors = driver.find_elements_by_xpath('//div[#class="corpus"]//div[#class="metadata"]//dl[#class="author-name"]//dd//a')
dates = driver.find_elements_by_xpath('//div[#class="corpus"]//div[#class="metadata"]//dl[#class="date"]//dd')
questions_list = []
for question in questions:
for author in authors:
for date in dates:
questions_list.append([question.text, author.text, date.text])
print(question.text)
print(author.text)
print(date.text)
q_df = pd.DataFrame(questions_list)
q_df.to_csv('colrow.csv')
Improved XPATH, and removed second loop.
page = 1
while page <= 10:
driver.get(
'https://forum.bouyguestelecom.fr/questions/browse?flow_state=published&order=created_at.desc&page=' + str(
page) + '&utf8=✓&search=&with_category%5B%5D=2483')
driver.maximize_window()
print("Page url: " + driver.current_url)
time.sleep(1)
if page == 1:
AcceptButton = driver.find_element(By.ID, 'popin_tc_privacy_button_3')
AcceptButton.click()
questions = driver.find_elements(By.XPATH, '//div[#class="corpus"]//a[#class="content_permalink"]')
for count, item in enumerate(questions, start=1):
print(str(count) + ": question detail:")
questionfount = driver.find_element(By.XPATH,
"(//div[#class='corpus']//a[#class='content_permalink'])[" + str(
count) + "]")
questionfount.click()
questionInPage = WebDriverWait(driver, 20).until(EC.visibility_of_element_located(
(By.XPATH, "(//p[#class='old-h1']//following::div[contains(#__uid__, "
"'dim')]//div[#class='corpus']//a["
"#class='content_permalink'])[1]")))
author = WebDriverWait(driver, 20).until(EC.visibility_of_element_located(
(By.XPATH, "(//p[#class='old-h1']//following::div[contains(#__uid__, 'dim')]//div["
"#class='corpus']//div[contains(#class, 'metadata')]//dl["
"#class='author-name']//a)[1]")))
date = WebDriverWait(driver, 20).until(EC.visibility_of_element_located(
(By.XPATH, "(//p[#class='old-h1']//following::div[contains(#__uid__, 'dim')]//div["
"#class='corpus']//div[contains(#class, 'metadata')]//dl[#class='date']//dd)[1]")))
print(questionInPage.text)
print(author.text)
print(date.text)
print(
"-----------------------------------------------------------------------------------------------------------")
driver.back()
driver.refresh()
page = page + 1
driver.quit()
Output (in Console):
Page url: https://forum.bouyguestelecom.fr/questions/browse?flow_state=published&order=created_at.desc&page=1&utf8=%E2%9C%93&search=&with_category%5B%5D=2483
1: question detail:
Comment annuler ma commande bbox
ELHADJI
17 novembre 2021
-----------------------------------------------------------------------------------------------------------
2: question detail:
BBOX adsl : Interruption Service Internet ?
GABRIELA
17 novembre 2021
-----------------------------------------------------------------------------------------------------------
to overcome this issue i found that the div with auto-generated class had a uid
so here is what the xpath looks like now :
questions = driver.find_elements_by_xpath('//div[#__uid__="dim2"]//div[#class="corpus"]//div[#class="body-bd"]//p')
sometimes we just gotta focus right !

navigating through pagination with selenium in python

I'm scraping this website using Python and Selenium. I have the code working but it currently only scrapes the first page, I would like to iterate through all the pages and scrape them all but they handle pagination in a weird way how would I go through the pages and scrape them one by one?
Pagination HTML:
<div class="pagination">
First
Prev
1
<span class="current">2</span>
3
4
Next
Last
</div>
Scraper:
import re
import json
import requests
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.chrome.options import Options
options = Options()
# options.add_argument('--headless')
options.add_argument("start-maximized")
options.add_argument('disable-infobars')
driver=webdriver.Chrome(chrome_options=options,
executable_path=r'/Users/weaabduljamac/Downloads/chromedriver')
url = 'https://services.wiltshire.gov.uk/PlanningGIS/LLPG/WeeklyList'
driver.get(url)
def getData():
data = []
rows = driver.find_element_by_xpath('//*[#id="form1"]/table/tbody').find_elements_by_tag_name('tr')
for row in rows:
app_number = row.find_elements_by_tag_name('td')[1].text
address = row.find_elements_by_tag_name('td')[2].text
proposals = row.find_elements_by_tag_name('td')[3].text
status = row.find_elements_by_tag_name('td')[4].text
data.append({"CaseRef": app_number, "address": address, "proposals": proposals, "status": status})
print(data)
return data
def main():
all_data = []
select = Select(driver.find_element_by_xpath("//select[#class='formitem' and #id='selWeek']"))
list_options = select.options
for item in range(len(list_options)):
select = Select(driver.find_element_by_xpath("//select[#class='formitem' and #id='selWeek']"))
select.select_by_index(str(item))
driver.find_element_by_css_selector("input.formbutton#csbtnSearch").click()
all_data.extend( getData() )
driver.find_element_by_xpath('//*[#id="form1"]/div[3]/a[4]').click()
driver.get(url)
with open( 'wiltshire.json', 'w+' ) as f:
json.dump( all_data, f )
driver.quit()
if __name__ == "__main__":
main()
Before moving on to automating any scenario, always write down the manual steps you would perform to execute the scenario. Manual steps for what you want to (which I understand from the question) is -
1) Go to site - https://services.wiltshire.gov.uk/PlanningGIS/LLPG/WeeklyList
2) Select first week option
3) Click search
4) Get the data from every page
5) Load the url again
6) Select second week option
7) Click search
8) Get the data from every page
.. and so on.
You are having a loop to select different weeks but inside each loop iteration for the week option, you also need to include a loop to iterate over all the pages. Since you are not doing that, your code is returning only the data from the first page.
Another problem is with how you are locaing the 'Next' button -
driver.find_element_by_xpath('//*[#id="form1"]/div[3]/a[4]').click()
You are selecting the 4th <a> element which is ofcourse not robust because in different pages, the Next button's index will be different. Instead, use this better locator -
driver.find_element_by_xpath("//a[contains(text(),'Next')]").click()
Logic for creating loop which will iterate through pages -
First you will need the number of pages. I did that by locating the <a> immediately before the "Next" button. As per the screenshot below, it is clear that the text of this element will be equal to the number of pages -
-
I did that using following code -
number_of_pages = int(driver.find_element_by_xpath("//a[contains(text(),'Next')]/preceding-sibling::a[1]").text)
Now once you have number of pages as number_of_pages, you only need to click "Next" button number_of_pages - 1 times!
Final code for your main function-
def main():
all_data = []
select = Select(driver.find_element_by_xpath("//select[#class='formitem' and #id='selWeek']"))
list_options = select.options
for item in range(len(list_options)):
select = Select(driver.find_element_by_xpath("//select[#class='formitem' and #id='selWeek']"))
select.select_by_index(str(item))
driver.find_element_by_css_selector("input.formbutton#csbtnSearch").click()
number_of_pages = int(driver.find_element_by_xpath("//a[contains(text(),'Next')]/preceding-sibling::a[1]").text)
for j in range(number_of_pages - 1):
all_data.extend(getData())
driver.find_element_by_xpath("//a[contains(text(),'Next')]").click()
time.sleep(1)
driver.get(url)
with open( 'wiltshire.json', 'w+' ) as f:
json.dump( all_data, f )
driver.quit()
Following approach is simply worked for me.
driver.find_element_by_link_text("3").click()
driver.find_element_by_link_text("4").click()
....
driver.find_element_by_link_text("Next").click()
first get the total pages in the pagination, using
ins.get('https://services.wiltshire.gov.uk/PlanningGIS/LLPG/WeeklyList/10702380,1')
ins.find_element_by_class_name("pagination")
source = BeautifulSoup(ins.page_source)
div = source.find_all('div', {'class':'pagination'})
all_as = div[0].find_all('a')
total = 0
for i in range(len(all_as)):
if 'Next' in all_as[i].text:
total = all_as[i-1].text
break
Now just loop through the range
for i in range(total):
ins.get('https://services.wiltshire.gov.uk/PlanningGIS/LLPG/WeeklyList/10702380,{}'.format(count))
keep incrementing the count and get the source code for the page and then get the data for it.
Note: Don't forget the sleep when clicking on going form one page to another

Selenium/Python - Lag In Code?

I wrote some code to extract Insider Trading information from the Toronto Stock Exchange. Using Selenium, open up this link and then, using a list of stocks, one by one input each into the form, retrieve the data and put it into another list, then do the same for the next stock.
Here is the code:
from selenium import webdriver
stocks = ['RKN','MG','GTE','IMO','REI.UN','RY']
dt = []
url = 'https://app.tmxmoney.com/research/insidertradesummaries?locale=EN'
driver = webdriver.Firefox()
driver.get(url)
search = driver.find_element_by_xpath('//ul[#class="nav nav-pills"]/li[3]')
search.click()
stock_form = driver.find_element_by_name('text')
for stock in stocks:
stock_form.clear()
stock_form.send_keys(stock)
stock_form.submit()
data = driver.find_element_by_xpath('//div[#class="insider-trades-symbol-search-container"]/div[#class="ng-binding"]')
a = data.text.split('\n')
if len(a) > 1:
dt.append(a[-1].split())
else:
dt.append([])
driver.close()
If you run the code, you can see each stock being input into the form, the data will pop up and I attempt to retrieve it. However, when I get the text from "data", its as if its retrieved from what was visible on the page prior to submitting the form. I tried adding a wait to the code to no avail.
added a time.sleep(1) and the code works as intended.
from selenium import webdriver
import time
stocks = ['RKN','MG','GTE','IMO','REI.UN','RY']
dt = []
url = 'https://app.tmxmoney.com/research/insidertradesummaries?locale=EN'
driver = webdriver.Firefox()
driver.get(url)
search = driver.find_element_by_xpath('//ul[#class="nav nav-pills"]/li[3]')
search.click()
stock_form = driver.find_element_by_name('text')
for stock in stocks:
stock_form.clear()
stock_form.send_keys(stock)
stock_form.submit()
data = driver.find_element_by_xpath('//div[#class="insider-trades-symbol-search-container"]/div[#class="ng-binding"]')
a = data.text.split('\n')
**time.sleep(1)**
if len(a) > 1:
dt.append(a[-1].split())
else:
dt.append([])
driver.close()

Categories