So I'm trying to scrape census data from a website that changes dynamically when a county is selected from the drop down menu. It looks like this:
<select id="cat_id_select_GEO" onchange="changeHeaderSelection('GEO');
<option value="0500000US01001" select="selected">Autaga County, Alabama</option>
<select>
a link
So from the research i've done, it sounds like i need to make some sort of Get request? (selenium?) but I am completely lost on how to do this. I know how to get the data i want, once i've made the county selection. But I've never had to scrape something where the website changes dynamically (i.e. the url doesn't change)
I understand that some may find this to be a simple question... but I've read numerous other similar questions and would greatly benefit from someone walking me through example, and/or directing me to a solid guide.
this is what i've been messing around with so far. I can see it kinda works at selecting the values... but it spits out this error: Message: stale element reference: element is not attached to the page document
(Session info: chrome=74.0.3729.169)
for index, row in StateURLs.iterrows():
url = row['URL']
state = row['STATE']
driver = webdriver.Chrome(executable_path=r'C:\chromedriver.exe')
driver.get(url)
select_county = Select(driver.find_element_by_id('cat_id_select_GEO'))
options = select_county.options
for index in range(0, len(options) - 1):
select_county.select_by_index(index)
I also would love help on how to then convert this webpages to beautiful soup so i can scrape each page after the selection is made
The main landing page does get requests with a query string that returns a json string containing the info from that is first returned when you submit your query including further urls that are listed on the results page.
import requests
search_term = 'searchTerm: Autauga County, Alabama'
search_term = search_term.replace(' ','+')
r = requests.get('https://factfinder.census.gov/rest/communityFactsNav/nav?N=0&_t=1558559559868&log=t&searchTerm=term ' + search_term + ',Alabama&src=').json()
Here is an example of that json
I can generate the correct url to use in the browser which returns all that data as json but can't seem to configure requests so works. Perhaps someone else can pick up this and work it out. I will look again tomorrow.
r = requests.get('https://factfinder.census.gov/rest/communityFactsNav/nav?N=0&_t=1558559559868&log=t&searchTerm=term ' + search_term + ',Alabama&src=', allow_redirects= True).json()
url = 'https://factfinder.census.gov' + r['CFMetaData']['measuresAndLinks']['links']['2017 American Community Survey'][0]['url']
code = url.split('/')[-2]
url = 'https://factfinder.census.gov/tablerestful/tableServices/renderProductData?renderForMap=f&renderForChart=f&pid=ACS_17_5YR_{}&prodToReplace=ACS_16_5YR_{}&log=t&_ts=576607332612'.format(code, code)
Related
I've been building this scraper (with some massive help from users here) to get data on some companies' debt with the public sector and I've been able to get to the site, input the desired
search parameters and scrape the first 50 results (out of 300). The problem I've encountered is that this page's pagination has the following characteristics:
It does not possess a next page button
The URL doesn't change with the pagination
The pagination is done with a Javascript script
Here's the code so far:
path_driver = "C:/Users/CS330584/Documents/Documentos de Defesa da Concorrência/Automatização de Processos/chromedriver.exe"
website = "https://sat.sef.sc.gov.br/tax.NET/Sat.Dva.Web/ConsultaPublicaDevedores.aspx"
value_search = "300"
final_table = []
driver = webdriver.Chrome(path_driver)
driver.get(website)
search_max = driver.find_element_by_id("Body_Main_Main_ctl00_txtTotalDevedores")
search_max.send_keys(value_search)
btn_consult = driver.find_element_by_id("Body_Main_Main_ctl00_btnBuscar")
btn_consult.click()
driver.implicitly_wait(10)
cnpjs = driver.find_elements_by_xpath("//*[#id='Body_Main_Main_grpDevedores_gridView']/tbody/tr/td[1]")
empresas = driver.find_elements_by_xpath("//*[#id='Body_Main_Main_grpDevedores_gridView']/tbody/tr/td[2]")
dividas = driver.find_elements_by_xpath("//*[#id='Body_Main_Main_grpDevedores_gridView']/tbody/tr/td[3]")
for i in range(len(empresas)):
temp_data = {'CNPJ' : cnpjs[i].text,
'Empresas' : empresas[i].text,
'Divida' : dividas[i].text
}
final_table.append(temp_data)
How can I navigate through the pages in order to scrape their data ? Thank you all for the help!
If you inspect the page and look at what happens when you click on the next page button, you'll see in the tag they're actually executing some javascript. It looks like this:
<font style="vertical-align: inherit;"><font style="vertical-align: inherit;">6</font></font>
But if you take that javascript call out of that href tag (and fix the " to be quotations) you'll see two function calls that look like this:
GridView_ScrollToTop("Body_Main_Main_grpDevedores_gridView");
__doPostBack('ctl00$ctl00$ctl00$Body$Main$Main$grpDevedores$gridView','Page$5');
Now I didn't take the time to analyze these functions in depth, but you don't really need to. You see the first call causes the browser to scroll to the top, and the second call actually causes the next page of data to load on the page. For your purposes, you only care about the second call.
You can mess around with this in the browser; Just perform your search and then, in the JS console, paste in the JS call, exchanging the number for the page you want to look at.
If you can do it via JS in the console on the webpage, you can do it with Selenium. You would do something like this to "click" each tab:
for(i in range(1, 7)):
js = "__doPostBack('ctl00$ctl00$ctl00$Body$Main$Main$grpDevedores$gridView','Page$" + str(i) + "');"
driver.execute_script(js)
#do scraping stuff
im trying to extract a simple title of a product from amazon.com using the id that the span which contains the title has.
this is what i wrote:
import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/Acer-SB220Q-Ultra-Thin-Frame-Monitor/dp/B07CVL2D2S/ref=lp_16225007011_1_7'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')
title = soup.find(id='productTitle').get_text()
print(title)
and i keep getting either none or empty list or i cant extract anything and gives me an attribute error saying that the object i used doesnt have an attribute get_text, which raised another question which is how to get the text of this simple span.
i really appreciate it if someone could figure it out and help me.
thanks in advance.
Problem
Running your code and checking the res value, you would get a 503 error. This means that the Service is unavailable (htttp status 503).
Solution
Following up, using this SO post, seems that adding the headers={"User-Agent":"Defined"} to the get requests does work.
res = requests.get(url, headers={"User-Agent": "Defined"})
Will return a 200 (OK) response.
The Twist
Amazon actually checks for web scrapers, and even though you will get a page back, printing the result (print(soup)) will likely show you the following:
<body>
<!--
To discuss automated access to Amazon data please contact api-services-support#amazon.com.
For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_c_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases.
-->
...
<h4>Enter the characters you see below</h4>
<p class="a-last">Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.</p>
</div>
</div>
The counter
But you can use selenium to simulate a human. A minimal working example for me was the following:
import selenium.webdriver
url = 'http://www.amazon.com/Acer-SB220Q-Ultra-Thin-Frame-Monitor/dp/B07CVL2D2S/ref=lp_16225007011_1_7'
driver = selenium.webdriver.Firefox()
driver.get(url)
title = driver.find_element_by_id('productTitle').text
print(title)
Which prints out
Acer SB220Q bi 21.5 Inches Full HD (1920 x 1080) IPS Ultra-Thin Zero Frame Monitor (HDMI & VGA Port), Black
A small thing when using selenium is that it is much slower than the requests library. Also a new screen will pop-up that shows the page, but luckily we can do something about that screen by using a headless driver.
I'm trying to make searching for temporary apartments a bit easier on myself, but a website with listings for these apartments requires me to select a suggestion from their drop down list before I can click on submit. No matter how complete the entry in the search box might be.
The ultimate hope here is that I can get forward to the search results and then extract contact information from each listing. I was able to extract the data I need from a listing using Beautiful soup and Requests, but I had to paste in the URL for that specific listing into my code. I didn't get that far. If anyone has a suggestion on how to perhaps circumvent the landing page to get to the relevant listings, please let me know.
I tried just splicing the town name and the state name into the address bar by looking at how it's written after a successful search but that didn't work.
The site is Mein Monteurzimmer.
Here is my code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.select import Select
driver = webdriver.Firefox()
webpage = r"https://mein-monteurzimmer.de"
print('Prosim vnesi zeljeno mesto') #Please enter the town to search
searchterm = input()
driver.get(webpage)
sbox = driver.find_element_by_xpath("/html/body/main/cpagearea/section/div[2]/div/section[1]/div/div[1]/section/form/div/input")
sbox.send_keys(searchterm)
ddown = driver.find_element_by_xpath("/html/body/main/cpagearea/section/div[2]/div/section[1]/div/div[1]/section/form/div")
ddown.select_by_value(1)
webdriver.wait(2)
#select = driver.find_element_by_xpath("/html/body/main/cpagearea/section/div[2]/div/section[1]/div/div[1]/section/form/div")
submit = driver.find_element_by_xpath("/html/body/main/cpagearea/section/div[2]/div/section[1]/div/div[1]/section/form/button")
submit.click
When I inspect the search box I can't find anything related to the suggestions until I enter a text. Then I can't click on the HTML code because that dismisses the suggestions. It's quite frustrating.
Here's a screenshot:
So I'm blindly trying to select something.
The error here is:
AttributeError: 'FirefoxWebElement' object has no attribute 'select_by_value'
I tried something with select, but that doesn't work with the way I tried this.
I am stumped and the solutions I could find were specific for other sites like Google or Amazon and I couldn't make sense if it.
Does anyone know how I could make this work?
Here's the code for getting information out of a listing, which I'll have to expand on to get the other data:
import bs4, requests
def getMonteurAddress(MonteurUrl):
res = requests.get(MonteurUrl)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('section.c:nth-child(4) > div:nth-child(2) > div:nth-child(2) > dl:nth-child(1) > dd:nth-child(2)')
return elems[0].text.strip()
address = getMonteurAddress('https://mein-monteurzimmer.de/105742/monteurzimmer/deggendorf-monteurzimmer-deggendorf-pensionfelix%40googlemailcom')
print('Naslov je ' + address) #print call to see if it gets the right data
As you can see once you type in, there is a list of divs creating. Now you need to get the a valid locator for these divs. To get the locator for these created divs you need to inspect elements in debug pause mode ( F12--> Source Tab --> F8).
Try below code to select first matching address as you typed.
sbox = driver.find_element_by_xpath("//input[#placeholder='Adresse, PLZ oder Ort eingeben']")
sbox.send_keys(searchterm)
addessXpath = "//div[contains(text(),'"+searchterm+"')]"
driver.find_element_by_xpath(addessXpath).click()
Note : If there are more than one matching address , first one will be selected.
I am quite new to Python and am building a web scraper, which will scrape the following page and links in them: https://www.nalpcanada.com/Page.cfm?PageID=33
The problem is the page's default is to display the first 10 search results, however, I want to scrape all 150 search results (when 'All' is selected, there are 150 links).
I have tried messing around with the URL, but the URL remains static no matter what display results option is selected. I have also tried to look at the Network section of the Developer Tools on Chrome, but can't seem to figure out what to use to display all results.
Here is my code so far:
import bs4
import requests
import csv
import re
response = requests.get('https://www.nalpcanada.com/Page.cfm?PageID=33')
soup = bs4.BeautifulSoup(response.content, "html.parser")
urls = []
for a in soup.findAll('a', href=True, class_="employerProfileLink", text="Vancouver, British Columbia"):
urls.append(a['href'])
pagesToCrawl = ['https://www.nalpcanada.com/' + url + '&QuestionTabID=47' for url in urls]
for pages in pagesToCrawl:
html = requests.get(pages)
soupObjs = bs4.BeautifulSoup(html.content, "html.parser")
nameOfFirm = soupObjs.find('div', class_="ip-left").find('h2').next_element
tbody = soupObjs.find('div', {"id":"collapse8"}).find('tbody')
offers = tbody.find('td').next_sibling.next_sibling.next_element
seeking = tbody.find('tr').next_sibling.next_sibling.find('td').next_sibling.next_sibling.next_element
print('Firm name:', nameOfFirm)
print('Offers:', offers)
print('Seeking:', seeking)
print('Hireback Rate:', int(offers) / int(seeking))
Replacing your response call with this code seems to work. The reason is that you weren't passing in the cookie properly.
response = requests.get(
'https://www.nalpcanada.com/Page.cfm',
params={'PageID': 33},
cookies={'DISPLAYNUM': '100000000'}
)
The only other issue I came across was that a ValueError was being raised by this line when certain links (like YLaw Group) don't seem to have "offers" and/or "seeking".
print('Hireback Rate:', int(offers) / int(seeking))
I just commented out the line since you will have to decide what to do in those cases.
I am writing a scraper to get all the movie list available on hungama.com
I am requesting "http://www.hungama.com/all/hungama-picks-54/4470/" url to get the response.
When you go to this url, this will show 12 movies on the screen but as you sroll down the movie count gets increasing by auto reload.
I am parsing the html source page with below code
response.css('div.movie-block-artist.boxshadow.clearfix1>div>div>a::text').extract()
but I only get 12 items whereas there are more movie items. how can I get all the movies available. Please help.
While scrolling down the content of that page, If you take a good look at xhr tab in network category within dev tools then you can see that it produces some URLs with pagination feature attached to it like :http://www.hungama.com/all/hungama-picks-54/3632/2/. So, changing the line as I did below, you can get all the content from that page.
import requests
from scrapy import Selector
page = 1
URL = "http://www.hungama.com/all/hungama-picks-54/3632/"
while True:
page+=1
res = requests.get(URL)
sel = Selector(res)
container = sel.css(".leftbox")
if len(container)<=0:break
for item in container:
title = item.css("#pajax_a::text").extract_first()
year = item.css(".subttl::text").extract_first()
print(title,year)
next_page = "http://www.hungama.com/all/hungama-picks-54/3632/{}/"
URL = next_page.format(page)
Btw, the URL you have provided above is not working. The one I've supplied is active now. However, you understood the logic I think.
There seems to be an ajax request as a lazy load feature with url http://www.hungama.com/all/hungama-picks-54/4470/2/?ajax_call=1&_country=IN which fetches movies .
In the above url change 2 to 3 (http://www.hungama.com/all/hungama-picks-54/4470/3/?ajax_call=1&_country=IN) and so on for getting next movies detail.