I am using Google Map to get the GPS coordinates of address that I am searching. I want to get the URL after I chick on Submit so that I can extract GPS coordinates from URL. However my URL only shows: https://www.google.com/maps
url = "http://maps.google.com/"
locationAdrs = '957 ASHBY GROVE SW ATLANTA'
browser = webdriver.Chrome(executable_path="C:/Users/joe/AppData/Local/Programs/Python/Python37-32/PyOn/chromedriver")
browser.get(url)
address = browser.find_element_by_xpath('//*[#id="searchboxinput"]')
address.send_keys(locationAdrs)
address.submit()
url = browser.current_url
print(url)
You have to re-affirm what the de-facto accessed link is, because your inputted link may not correspond with the DNS-route that finally connects you to the final destination. Then you have to wait for your browser to update before you return the new address that you are accessing:
url = "https://www.google.com/maps"
locationAdrs = '957 ASHBY GROVE SW ATLANTA'
browser.get(url)
address = browser.find_element_by_xpath('//*[#id="searchboxinput"]')
address.send_keys(locationAdrs)
# address.submit() - doesn't seem to do the right thing.
url = browser.current_url # have initial url on same format before click is made to move away
browser.find_element_by_xpath('//*[#id="searchbox-searchbutton"]').click()
while url == browser.current_url:
time.sleep(2)
url = browser.current_url
print(url)
Output:
https://www.google.com/maps/place/957+Ashby+Grove+SW,+Atlanta,+GA+30314,+USA/#33.7500669,-84.4211224,17z/data=!3m1!4b1!4m5!3m4!1s0x88f5035d3de5336f:0x9ca82913b5ecbde!8m2!3d33.7500669!4d-84.4189284
Related
I'm trying to get the entire page content from this page: https://www.csaregistries.ca/GHG_VR_Listing/CleanProjectProjects
It seems that the page automatically limits the displayed content while scrolling.
For instance, 1st code returns names starting with A - C, whereas 2nd code returns L - W. Is there any way to pull out the entire page content. (I'm not asking page_source to be answered or scroll down or even the time.sleep, but I'm asking you how to override the page's unknown limitation by selenium)
And the codes used are as below:
1st (w/o going to the page bottom):
url = "https://www.csaregistries.ca/GHG_VR_Listing/CleanProjectProjects"
browser = webdriver.Chrome(ChromeDriverManager().install())
browser.maximize_window()
browser.get(url)
time.sleep(10)
content = browser.page_source.encode('utf-8')
file_ = open('result1.html', 'wb')
file_.write(content)
file_.close()
browser.close()
2nd (w/ going to the page bottom):
url = "https://www.csaregistries.ca/GHG_VR_Listing/CleanProjectProjects"
browser = webdriver.Chrome(ChromeDriverManager().install())
browser.maximize_window()
browser.get(url)
time.sleep(10)
elem = browser.find_element_by_xpath(
'/html/body/div/div/div/div/div/div/div[2]/footer/div/div/div/div[3]')
actions = ActionChains(browser)
actions.click(elem).perform()
time.sleep(5)
content = browser.page_source.encode('utf-8')
file_ = open('result1.html', 'wb')
file_.write(content)
file_.close()
browser.close()
I am trying to search through multiple pages through an API GET request but I am unsure how to search passed the first page of results.
The first GET request will download the first page of results but to get to the next page you need to GET the new URL which is listed at the bottom of the first page.
def getaccount_data():
view_code = input("Enter Account: ")
global token
header = {"X-Fid-Identity": token}
url = BASE_URL + '/data/v1/entity/account'
accountdata = requests.get(url, headers=header, verify = False)
newaccountdata = accountdata.json()
for data in newaccountdata ['results']:
if (data['fields']['view_code']) == view_code:
print("Account is set up")
else:
url =(newaccountdata['moreUri'])
print(url)
getaccount_data()
Is there anyway to search all the pages by updating the url to get to the next page?
I want to request this url:
https://www.codal.ir/CompanyList.aspx
This url contains tables on 110 pages that when the page is changed, neither the url nor the new request is changed.
this is my code:
import requests as req
req = req.Session()
isics = req.get("https://www.codal.ir/CompanyList.aspx")
print(isics.text)
but I only get the first page information.I intend to extract the required information from the tables by request and regex but if you have another way I will be happy to hear .Thanks for helping me get the whole pages.
I used Selenium to navigate in the table. You can't do that with requests, because we do not have links that redirect us to the new page in the table. You can find the code below.
from bs4 import BeautifulSoup
from selenium import webdriver
import time
def get_company_links(links, driver):
soup = BeautifulSoup(driver.page_source, "html.parser")
rows = soup.select("table.companies-table tr")
for row in rows:
link = row.select_one("a")
if(link):
links.append("https://www.codal.ir/" + link['href'])
options = webdriver.ChromeOptions()
#options.add_argument("--headless")
driver = webdriver.Chrome(options=options)
driver.get("https://www.codal.ir/CompanyList.aspx")
current_page_button = driver.find_element_by_css_selector('input[type="submit"].normal.selected')
page_number = int(current_page_button.get_attribute('value'))
while(True):
get_company_links(links, driver)
next_page_button = driver.find_element_by_css_selector('input#ctl00_ContentPlaceHolder1_ucPager1_btnNext')
next_page_button.click()
time.sleep(2)
previous_page_number = page_number
current_page_button = driver.find_element_by_css_selector('input[type="submit"].normal.selected')
page_number = int(current_page_button.get_attribute('value'))
if(previous_page_number == page_number):
break # no more page left
print(links)
Main working principle is navigating through the table and collecting links of the company websites. We use next button to navigate and stop when last page index is equal to the current index, which indicates that we arrived at the end of the table.
shoe = input('Shoe name: ')
URL = 'https://stockx.com/search?s='+shoe
page = requests.get(URL, headers= headers)
soup = BeautifulSoup(page.content, 'html.parser')
time.sleep(2) #this was to ensure the webpage was having enough time to load so that it wouldn't try to scrape a prematurely loaded website.
test = soup.find(class_ = 'BrowseSearchDescription__SearchConfirmation-sc-1mt8qyd-1 dcjzxm')
print(test) #returns none
print(URL) #prings the URL (which is the correct URL of the website I'm attempting to scrape)
I understand that I could easily do this with Selenium, however, it is very inefficient as it loads up the chrome tab and navigates to the web page. I'm trying to make this efficient, and my original "prototype" did use Selenium however it was always detected as a bot and my whole code was stopped by captchas. Am I doing something wrong that is causing the code to return 'None' or is that specific webpage unscrape-able. If you need, the specific URL is https://stockx.com/search?s=yeezy
I tried your code and here is the result.
Code
shoe = 'yeezy'
URL = 'https://stockx.com/search?s='+shoe
page = requests.get(URL)
soup = bs.BeautifulSoup(page.content, 'html.parser')
And when I see what's inside the soup, here is the result.
Result
..
..
<div id="px-captcha">
</div>
<p> Access to this page has been denied because
we believe you are using automation tools to browse the website.</p>
..
..
Yes I guess the developers didn't want the website being scraped.
I am learning Python scraping technique but I am stuck with the problem of scraping an Ajax page like this one.
I want to scrape all the medicines name and details coming in the page. Since I read most of the answer on the stack overflow but I am not getting the right data after scraping. I also tried to scrape using selenium or send a forge post request but it failed.
So please help me on this Ajax scraping topic specially this page because ajax is triggered on selecting an option from dropdown options.
Also please provide me with some resources for ajax page scraping.
//using selenium
from selenium import webdriver
import bs4 as bs
import lxml
import requests
path_to_chrome = '/home/brutal/Desktop/chromedriver'
browser = webdriver.Chrome(executable_path = path_to_chrome)
url = 'https://www.gianteagle.com/Pharmacy/Savings/4-10-Dollar-Drug-Program/Generic-Drug-Program/'
browser.get(url)
browser.find_element_by_xpath('//*[#id="ctl00_RegionPage_RegionPageMainContent_RegionPageContent_userControl_StateList"]/option[contains(text(), "Ohio")]').click()
new_url = browser.current_url
r = requests.get(new_url)
print(r.content)
ChromeDriver you can download here
normalize-space is used in order to remove trash from web text, such as x0
from time import sleep
from selenium import webdriver
from lxml.html import fromstring
data = {}
driver = webdriver.Chrome('PATH TO YOUR DRIVER/chromedriver') # i.e '/home/superman/www/myproject/chromedriver'
driver.get('https://www.gianteagle.com/Pharmacy/Savings/4-10-Dollar-Drug-Program/Generic-Drug-Program/')
# Loop states
for i in range(2, 7):
dropdown_state = driver.find_element(by='id', value='ctl00_RegionPage_RegionPageMainContent_RegionPageContent_userControl_StateList')
# open dropdown
dropdown_state.click()
# click state
driver.find_element_by_xpath('//*[#id="ctl00_RegionPage_RegionPageMainContent_RegionPageContent_userControl_StateList"]/option['+str(i)+']').click()
# let download the page
sleep(3)
# prepare HTML
page_content = driver.page_source
tree = fromstring(page_content)
state = tree.xpath('//*[#id="ctl00_RegionPage_RegionPageMainContent_RegionPageContent_userControl_StateList"]/option['+str(i)+']/text()')[0]
data[state] = []
# Loop products inside the state
for line in tree.xpath('//*[#id="ctl00_RegionPage_RegionPageMainContent_RegionPageContent_userControl_gridSearchResults"]/tbody/tr[#style]'):
med_type = line.xpath('normalize-space(.//td[#class="medication-type"])')
generic_name = line.xpath('normalize-space(.//td[#class="generic-name"])')
brand_name = line.xpath('normalize-space(.//td[#class="brand-name hidden-xs"])')
strength = line.xpath('normalize-space(.//td[#class="strength"])')
form = line.xpath('normalize-space(.//td[#class="form"])')
qty_30_day = line.xpath('normalize-space(.//td[#class="30-qty"])')
price_30_day = line.xpath('normalize-space(.//td[#class="30-price"])')
qty_90_day = line.xpath('normalize-space(.//td[#class="90-qty hidden-xs"])')
price_90_day = line.xpath('normalize-space(.//td[#class="90-price hidden-xs"])')
data[state].append(dict(med_type=med_type,
generic_name=generic_name,
brand_name=brand_name,
strength=strength,
form=form,
qty_30_day=qty_30_day,
price_30_day=price_30_day,
qty_90_day=qty_90_day,
price_90_day=price_90_day))
print('data:', data)
driver.quit()