web Scraping search Results in Python

web Scraping search Results in Python - python

I am very new to web scraping with Python. In the web page, which I am trying to scrape, I can enter string 'ABC' in the text box and click search. This gives me the details of 'ABC', but under the same URL. There is no change in url. I am trying to scrape the result details information.
I have worked till the "search" click. But I do not know how to capture the results of the search (details of search string 'ABC'). Please suggest how could I achieve it.
from selenium import webdriver
import webbrowser
new = 2 # open in a new tab, if possible
path_to_chromedriver = 'C:/Tech-stuffs/chromedriver/chromedriver.exe' # change path as needed
browser = webdriver.Chrome(executable_path = path_to_chromedriver)
url = 'https://www.federalreserve.gov/apps/mdrm/data-dictionary'
browser.get(url)
browser.find_element_by_xpath('//*[#id="form0"]/table/tbody/tr[2]/td/label[2]').click()
browser.find_element_by_xpath("//select[#id='SelectedReportForm']/option[#value='1']").click()
browser.find_element_by_xpath('//*[#id="Search"]').click()

Use find_elements_by_xpath() to locate the xpath which entails all of the search results. Then iterate through them using a for loop and print each result's text. That should, at the bare minimum, get what you want.
results = browser.find_elements_by_xpath('//table//tr')
for result in results:
print "%s\n" % result.text

Related

How can I select a search suggestion using selenium? The site prevents me from just clicking on submit, requires a selection

I'm trying to make searching for temporary apartments a bit easier on myself, but a website with listings for these apartments requires me to select a suggestion from their drop down list before I can click on submit. No matter how complete the entry in the search box might be.
The ultimate hope here is that I can get forward to the search results and then extract contact information from each listing. I was able to extract the data I need from a listing using Beautiful soup and Requests, but I had to paste in the URL for that specific listing into my code. I didn't get that far. If anyone has a suggestion on how to perhaps circumvent the landing page to get to the relevant listings, please let me know.
I tried just splicing the town name and the state name into the address bar by looking at how it's written after a successful search but that didn't work.
The site is Mein Monteurzimmer.
Here is my code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.select import Select
driver = webdriver.Firefox()
webpage = r"https://mein-monteurzimmer.de"
print('Prosim vnesi zeljeno mesto') #Please enter the town to search
searchterm = input()
driver.get(webpage)
sbox = driver.find_element_by_xpath("/html/body/main/cpagearea/section/div[2]/div/section[1]/div/div[1]/section/form/div/input")
sbox.send_keys(searchterm)
ddown = driver.find_element_by_xpath("/html/body/main/cpagearea/section/div[2]/div/section[1]/div/div[1]/section/form/div")
ddown.select_by_value(1)
webdriver.wait(2)
#select = driver.find_element_by_xpath("/html/body/main/cpagearea/section/div[2]/div/section[1]/div/div[1]/section/form/div")
submit = driver.find_element_by_xpath("/html/body/main/cpagearea/section/div[2]/div/section[1]/div/div[1]/section/form/button")
submit.click
When I inspect the search box I can't find anything related to the suggestions until I enter a text. Then I can't click on the HTML code because that dismisses the suggestions. It's quite frustrating.
Here's a screenshot:
So I'm blindly trying to select something.
The error here is:
AttributeError: 'FirefoxWebElement' object has no attribute 'select_by_value'
I tried something with select, but that doesn't work with the way I tried this.
I am stumped and the solutions I could find were specific for other sites like Google or Amazon and I couldn't make sense if it.
Does anyone know how I could make this work?
Here's the code for getting information out of a listing, which I'll have to expand on to get the other data:
import bs4, requests
def getMonteurAddress(MonteurUrl):
res = requests.get(MonteurUrl)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('section.c:nth-child(4) > div:nth-child(2) > div:nth-child(2) > dl:nth-child(1) > dd:nth-child(2)')
return elems[0].text.strip()
address = getMonteurAddress('https://mein-monteurzimmer.de/105742/monteurzimmer/deggendorf-monteurzimmer-deggendorf-pensionfelix%40googlemailcom')
print('Naslov je ' + address) #print call to see if it gets the right data

As you can see once you type in, there is a list of divs creating. Now you need to get the a valid locator for these divs. To get the locator for these created divs you need to inspect elements in debug pause mode ( F12--> Source Tab --> F8).
Try below code to select first matching address as you typed.
sbox = driver.find_element_by_xpath("//input[#placeholder='Adresse, PLZ oder Ort eingeben']")
sbox.send_keys(searchterm)
addessXpath = "//div[contains(text(),'"+searchterm+"')]"
driver.find_element_by_xpath(addessXpath).click()
Note : If there are more than one matching address , first one will be selected.

For Loops while using selenium for webscraping Python

I am attempting to web-scrape info off of the following website: https://www.axial.net/forum/companies/united-states-family-offices/
I am trying to scrape the description for each family office, so "https://www.axial.net/forum/companies/united-states-family-offices/"+insert_company_name" are the pages I need to scrape.
So I wrote the following code to test the program for just one page:
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome('insert_path_here/chromedriver')
driver.get("https://network.axial.net/company/ansaco-llp")
page_source = driver.page_source
soup2 = soup(page_source,"html.parser")
soup2.findAll('axl-teaser-description')[0].text
This works for the single page, as long as the description doesn't have a "show full description" drop down button. I will save that for another question.
I wrote the following loop:
#Note: Lst2 has all the names for the companies. I made sure they match the webpage
lst3=[]
for key in lst2[1:]:
driver.get("https://network.axial.net/company/"+key.lower())
page_source = driver.page_source
for handle in driver.window_handles:
driver.switch_to.window(handle)
word_soup = soup(page_source,"html.parser")
if word_soup.findAll('axl-teaser-description') == []:
lst3.append('null')
else:
c = word_soup.findAll('axl-teaser-description')[0].text
lst3.append(c)
print(lst3)
When I run the loop, all of the values come out as "null", even the ones without "click for full description" buttons.
I edited the loop to instead print out "word_soup", and the page is different then if I had run it without a loop and does not have the description text.
I don't understand why a loop would cause that but apparently it does. Does anyone know how to fix this problem?

Found solution. pause the program for 3 seconds after driver.get:
import time
lst3=[]
for key in lst2[1:]:
driver.get("https://network.axial.net/company/"+key.lower())
time.sleep(3)
page_source = driver.page_source
word_soup = soup(page_source,"html.parser")
if word_soup.findAll('axl-teaser-description') == []:
lst3.append('null')
else:
c = word_soup.findAll('axl-teaser-description')[0].text
lst3.append(c)
print(lst3)

I see that the page uses javascript to generate the text meaning it doesn't show up in the page source, which is weird but ok. I don't quite understand why you're only iterating through and switching to all the instances of Selenium you have open, but you definitely won't find the description in the page source / beautifulsoup.
Honestly, I'd personally look for a better website if you can, otherwise, you'll have to try it with selenium which is inefficient and horrible.

How to webscrape with Selenium when URL remains static after click

I am new to Selenium/Firefox. My goal is to go to my URL, fill in basic input, select a few items, let browser change the content and download a PDF from there. Ideally, I would love to do it repeatedly later by looping a number of new items. As a first step, I manage to get the browser to work and change content once. But I am stuck in getting the content out as find_elements_by_tag_name() seem to get me something funny rather than some usual HTML tag like what Beautifulsoup .find_all() would do. Appreciate very much any help here.
Here is my code:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
url ='http://www.hkexnews.hk/listedco/listconews/advancedsearch/search_active_main.aspx'
browser = webdriver.Firefox(executable_path = 'C:\Program Files\Mozilla
Firefox\geckodriver.exe')
browser.get(url)
StockElem = browser.find_element_by_id('ctl00_txt_stock_code')
StockElem.send_keys('00772')
StockElem.click()
select = Select(browser.find_element_by_id('ctl00_sel_tier_1'))
select.select_by_value('3')
select = Select(browser.find_element_by_id('ctl00_sel_tier_2'))
select.select_by_value('153')
select = Select(browser.find_element_by_id('ctl00_sel_DateOfReleaseFrom_d'))
select.select_by_value('01')
select = Select(browser.find_element_by_id('ctl00_sel_DateOfReleaseFrom_m'))
select.select_by_value('01')
select = Select(browser.find_element_by_id('ctl00_sel_DateOfReleaseFrom_y'))
select.select_by_value('2000')
# select the search button
browser.execute_script("document.forms[0].submit()")
element = browser.find_elements_by_tag_name("a")
print(element)

After clicking on the Search button -- you have 5 links to download PDF files.
You should find those links by CSS selector: .news.
Then go through the list of links by index and click on each link to Download:
elements[0].click() -- by clicking on the first link.

How to loop though HTML and return id values

I am using selenium to navigate to a webpage and store the page source in a variable.
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get("http://google.com")
html1 = driver.page_source
html1 now contains the page source of http://google.com.
My question is How can I return html selectors such as id="id" or name="name".
EDIT:
For example:
The webpage I navigated to with selenium has a menu bar with 4 tabs. Each tab has an id element; id="tab1", id="tab2", and so on. I would like to return each id value. So I want tab1, tab2, so on.
Edit#2:
Another example:
The homepage on my webpage (http://chrisarroyo.me) have several clickable links with ids. I would like to be able to return/print those ids to my console.
So I would like to return the ids for the Learn More button and the ids for the links in the footer (facebookLnk, githubLnk, etc..)

If you are looking for a list of WebElements that have an ID use:
elements = driver.find_elements_by_xpath("//*[#id]")
You can then iterate over that list and use get_attribute_("id") to pull out each elements specific ID.
For name, its pretty much the same code. Except change id to name and you're set.

Thank you #stewartm you comment helped.
This ended up giving me the results I was looking for:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get("http://chrisarroyo.me")
id_elements = driver.find_elements_by_xpath("//*[#id]")
for eachElement in id_elements:
individual_ids = eachElement.get_attribute("id")
print(individual_ids)
After running the above ^^ the output listed each of the ids on the webpage specified.
output:
navbarNavAltMarkup
learnBtn
githubLnk
facebookLnk
linkedinLnk

Facing issues while scraping data from a table using python with selenium

I've written a script using python in combination with selenium to parse table from a target page which can be reached out following some steps I've tried to describe below for the clarity. It does reach the destination but at the time of scraping data from that table It throws an error showing in the console "Unable to locate element". I tried with online xpath tester to see if it is wrong but I found that the xpath I've used in my script for "td_data" is right. I suppose, what I'm missing here is beyond my knowledge. Hope there is somebody to take a look into it and provide me with a workaround.
Btw, the site link is given in my script.
Link to see the html contents for the table: "https://www.dropbox.com/s/kaom5qzk78xndqn/Partial%20Html%20content%20for%20the%20table.txt?dl=0"
Steps to reach the target page which my script is able to maintain:
Selecting "I've read and understand above"
Putting this keyword "pump" in the inputbox located right below "Select medical devices".
Selecting the checkbox "Devices found for "pump".
Finally, pressing the search button
Script I've tried with so far:
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('http://apps.tga.gov.au/Prod/devices/daen-entry.aspx')
driver.find_element_by_id('disclaimer-accept').click()
time.sleep(5)
driver.find_element_by_id('medicine-name').send_keys('pump')
time.sleep(8)
driver.find_element_by_id('medicines-header-text').click()
driver.find_element_by_id('submit-button').click()
time.sleep(7)
for item in driver.find_elements_by_xpath('//div[#class="table-responsive"]'):
for tr_data in item.find_elements_by_xpath('.//tr'):
td_data = tr_data.find_element_by_xpath('.//span[#class="hovertext"]//a')
print(td_data.text)
driver.close()

Why don't you just do this:
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('http://apps.tga.gov.au/Prod/devices/daen-entry.aspx')
driver.find_element_by_id('disclaimer-accept').click()
time.sleep(5)
driver.find_element_by_id('medicine-name').send_keys('pump')
time.sleep(8)
driver.find_element_by_id('medicines-header-text').click()
driver.find_element_by_id('submit-button').click()
time.sleep(7)
for item in driver.find_elements_by_xpath(
'//table[#id]/tbody/tr/td[#class]/span[#class]/a[#id]'
):
print(item.text)
driver.close()
Output:
27233
27283
27288
27289
27390
27413
27441
27520
25445
27816
27866
27970
28033
28238
26999
28264
28407
28448
28437
28509
28524
28553
28647
28677
28646
Maybe you want to think about saving the page with driver.page_source, pull out the table, save it as a html file. Then use pandas from html to open the table into a dataframe

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

web Scraping search Results in Python - python

Related

How can I select a search suggestion using selenium? The site prevents me from just clicking on submit, requires a selection

For Loops while using selenium for webscraping Python

How to webscrape with Selenium when URL remains static after click

How to loop though HTML and return id values

Facing issues while scraping data from a table using python with selenium

Categories

Resources