Create get request in a webdriver with selenium - python

Is it possible to send a get request to a webdriver using selenium?
I want to scrape a website with an infinite page and want to scrape a substantial amount of the objects on the website. For this I use Selenium to open the website in a webdriver and scroll down the page until enough objects on the page are visible.
However, I'd like to scrape the information on the page with BeautifulSoup since this is the most effective way in this case. If the get request is send in the normal way (see the code) the response only holds the first objects and not the objects from the scrolled-down page (which makes sence).
But is there any way to send a get request to an open webdriver?
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import requests
from bs4 import BeautifulSoup
# Opening the website in the webdriver
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)
# Loop for scrolling
scroll_start = 0
for i in range(100):
scroll_end = scroll_start + 1080
driver.execute_script(f'window.scrollTo({scroll_start}, {scroll_end})')
time.sleep(2)
scroll_start = scroll_end
# The get request
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')

You should probably find out what is the endpoint that the website is using to get the data for the infinite scrolling.
Go to the website, open the Dev Tools, open the Network tab and find the HTTP request that is asking for the content you're seeking, then MAYBE you can use it too. Just know that there are a lot of variables like, are they using some sort of authorization for their APIs? Are the APIs returning JSON, XML, HTML, ...? Also, I am not sure if this is fair-use.

Related

Cannot select HTML element with BeautifulSoup

Novice web scraper here:
I am trying to scrape the name and address from this website https://propertyinfo.knoxcountytn.gov/Datalets/Datalet.aspx?sIndex=1&idx=1. I have attempted the following code which only returns 'None' or an empty array if I replace find() with find_all(). I would like it to return the html of this particular section so I can extract the text and later add it to a csv file. If the link doesn't work, or take to you where I'm working, simply go to the knox county tn website > property search > select a property.
Much appreciation in advance!
from splinter import Browser
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
from webdriver_manager.chrome import ChromeDriverManager
owner_soup = soup(html, 'html.parser')
owner_elem = owner_soup.find('td', class_='DataletData')
owner_elem
OR
# this being the tag and class of the whole section where the info is located
owner_soup = soup(html, 'html.parser')
owner_elem = owner_soup.find_all('div', class_='datalet_div_2')
owner_elem
OR when I try:
browser.find_by_css('td.DataletData')[15]
it returns:
<splinter.driver.webdriver.WebDriverElement at 0x11a763160>
and I can't pull the html contents from that element.
There's a few issues I see, but it could be that you didn't include your code as you actually have it.
Splinter works on its own to get page data by letting you control a browser. You don't need BeautifulSoup or requests if you're using splinter. You use requests if you want the raw response without running any of the things that browsers do for you automatically.
One of these automatic things is redirects. The link you provided does not provide the HTML that you are seeing. This link just has a response header that redirects you to https://propertyinfo.knoxcountytn.gov/, which redirects you again to https://propertyinfo.knoxcountytn.gov/search/commonsearch.aspx?mode=realprop, which redirects again to https://propertyinfo.knoxcountytn.gov/Search/Disclaimer.aspx?FromUrl=../search/commonsearch.aspx?mode=realprop
On this page you have to hit the 'agree' button to get redirected to https://propertyinfo.knoxcountytn.gov/search/commonsearch.aspx?mode=realprop, this time with these cookies set:
Cookie: ASP.NET_SessionId=phom3bvodsgfz2etah1wwwjk; DISCLAIMER=1
I'm assuming the session id is autogenerated, and the Disclaimer value just needs to be '1' for the server to know you agreed to their terms.
So you really have to study a page and understand what's going on to know how to do it on your own using just the requests and beautifulsoup libraries. Besides the redirects I mentioned, you still have to figure out what network request gives you that session id to manually add it to the cookie header you send on all future requests. You can avoid doing some requests, and so this way is a lot faster, but you do need to be able to follow along in the developer tools 'network' tab.
Postman is a good tool to help you set up requests yourself and see their result. Then you can bring all the set up from there into your code.

Web scraping when scrolling down is needed

I want to scrape, e.g., the title of the first 200 questions under the web page https://www.quora.com/topic/Stack-Overflow-4/all_questions. And I tried the following code:
import requests
from bs4 import BeautifulSoup
url = "https://www.quora.com/topic/Stack-Overflow-4/all_questions"
print("url")
print(url)
r = requests.get(url) # HTTP request
print("r")
print(r)
html_doc = r.text # Extracts the html
print("html_doc")
print(html_doc)
soup = BeautifulSoup(html_doc, 'lxml') # Create a BeautifulSoup object
print("soup")
print(soup)
It gave me a text https://pastebin.com/9dSPzAyX. If we search href='/, we can see that the html does contain title of some questions. However, the problem is that the number is not enough; actually on the web page, a user needs to manually scroll down to trigger extra load.
Does anyone know how I could mimic "scrolling down" by the program to load more content of the page?
Infinite scrolls on a webpage is based on the Javascript functionality. Therefore, to find out what URL we need to access and what parameters to use, we need to either thoroughly study the JS code working inside the page or, and preferably, examine the requests that the browser does when you scroll down the page. We can study requests using the Developer Tools.
See example for quora
the more you scroll down, the more requests generated. so now your requests will be done to that url instead of normal url but keep in mind to send correct headers and playload.
other easier solution will be by using selenium
Couldn't find a response using request. But you can use Selenium. First printed out the number of questions at first load, then send the End key to mimic scrolling down. You can see number of questions went from 20 to 40 after sending the End key.
I used driver.implicitly wait for 5 seconds before loading the DOM again in case the script load to fast before the DOM was loaded. You can improve by using EC with selenium.
The page loads 20 questions per scroll. So if you are looking to scrape 100 questions, then you need to send the End key 5 times.
To use the code below you need to install chromedriver.
http://chromedriver.chromium.org/downloads
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
CHROMEDRIVER_PATH = ""
CHROME_PATH = ""
WINDOW_SIZE = "1920,1080"
chrome_options = Options()
# chrome_options.add_argument("--headless")
chrome_options.add_argument("--window-size=%s" % WINDOW_SIZE)
chrome_options.binary_location = CHROME_PATH
prefs = {'profile.managed_default_content_settings.images':2}
chrome_options.add_experimental_option("prefs", prefs)
url = "https://www.quora.com/topic/Stack-Overflow-4/all_questions"
def scrape(url, times):
if not url.startswith('http'):
raise Exception('URLs need to start with "http"')
driver = webdriver.Chrome(
executable_path=CHROMEDRIVER_PATH,
chrome_options=chrome_options
)
driver.get(url)
counter = 1
while counter <= times:
q_list = driver.find_element_by_class_name('TopicAllQuestionsList')
questions = [x for x in q_list.find_elements_by_xpath('//div[#class="pagedlist_item"]')]
q_len = len(questions)
print(q_len)
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)
wait = WebDriverWait(driver, 5)
time.sleep(5)
questions2 = [x for x in q_list.find_elements_by_xpath('//div[#class="pagedlist_item"]')]
print(len(questions2))
counter += 1
driver.close()
if __name__ == '__main__':
scrape(url, 5)
I recommend using selenium rather than bs.
selenium can control browser and parsing. like scroll down, click button, etc…
this example is for scroll down for get all liker user in instagram.
https://stackoverflow.com/a/54882356/5611675
If the content only loads on "scrolling down", this probably means that the page is using Javascript to dynamically load the content.
You can try using a web client such as PhantomJS to load the page and execute the javascript in it, and simulate the scroll by injecting some JS such as document.body.scrollTop = sY; (Simulate scroll event using Javascript).

How to fetch token when it is generating by JS for sending requests?

I am creating a script for this site:
The first section (making the account is done):
https://my.shaadi.com/profile-creation/step/1?gtrk=1
However when configuring profiles I am having an issue, the page is loaded by JS and the token is generated using JS as well.
https://my.shaadi.com/static/js/main.4c82cc30.js
this is the JS file:
X-Access-Token: 2a719ecb4cf7a3ef45676834a596bc58|4SH80109362|
X-App-Key: 69c3f1c1ea31d60aa5516a439bb65949cf3f8a1330679fa7ff91fc9a5681b564
These are the 2 tokens I am looking to get
I can't figure out a way of getting these is it possible to use requests to do this or would it require a headless browser to run the JS (I am wanting to do it in pure python requests)
The best/easiest is use selenium or dryscrape and BeautifulSoup.
#from bs4 import BeautifulSoup
from selenium import webdriver
client = webdriver.PhantomJS()
#client.get('https://my.shaadi.com/profile-creation/step/1?gtrk=1')
client.get('https://my.shaadi.com/static/js/main.4c82cc30.js')
body = client.page_source
Now you can parse body with regexp or BeautifulSoup

Looping through web pages to webscrape data

I'm trying to loop through Zillow pages and extract data. I know that the URL is being updated with a new page number after each iteration but the data extracted is as if the URL is still on page 1.
import selenium
from selenium import webdriver
import requests
from bs4 import BeautifulSoup
import pandas as pd
next_page='https://www.zillow.com/romeo-mi-48065/real-estate-agent-reviews/'
num_data1=pd.DataFrame(columns=['name','number'])
browser=webdriver.Chrome()
browser.get('https://www.zillow.com/romeo-mi-48065/real-estate-agent-reviews/')
while True:
page=requests.get(next_page)
contents=page.content
soup = BeautifulSoup(contents, 'html.parser')
number_p=soup.find_all('p', attrs={'class':'ldb-phone-number'},text=True)
name_p=soup.find_all('p', attrs={'class':'ldb-contact-name'},text=True)
number_p=pd.DataFrame(number_p,columns=['number'])
name_p=pd.DataFrame(name_p,columns=['name'])
num_data=number_p['number'].apply(lambda x: x.text.strip())
nam_data=name_p['name'].apply(lambda x: x.text.strip())
number_df=pd.DataFrame(num_data,columns=['number'])
name_df=pd.DataFrame(nam_data,columns=['name'])
num_data0=pd.concat([number_df,name_df],axis=1)
num_data1=num_data1.append(num_data0)
try:
button=browser.find_element_by_css_selector('.zsg-pagination>li.zsg-pagination-next>a').click()
next_page=str(browser.current_url)
except IndexError:
break
Replace page=requests.get(next_page) with page = browser.page_source
Basically what's happening is that you're going to the next page in Chrome, but then trying to load that page's url with requests which is getting redirected back to page one by Zillow (probably because it doesn't have the cookies or appropriate request headers).
why not make your life easier and use the Zillow API instead of scraping? (do you even have permission to scrape their site?)

Render HTTP Response(HTML content) in selenium webdriver(browser)

I am using Requests module to send GET and POST requests to websites and then processing their responses. If the Response.text meets a certain criteria, I want it to be opened up in a browser. To do so currently I am using selenium package and resending the request to the webpage via the selenium webdriver. However, I feel it's inefficient as I have already obtained the response once, so is there a way to render this obtained Response object directly into the browser opened via selenium ?
EDIT
A hacky way that I could think of is to write the response.text to a temporary file and open that in the browser. Please let me know if there is a better way to do it than this ?
To directly render some HTML with Selenium, you can use the data scheme with the get method:
from selenium import webdriver
import requests
content = requests.get("http://stackoverflow.com/").content
driver = webdriver.Chrome()
driver.get("data:text/html;charset=utf-8," + content)
Or you could write the page with a piece of script:
from selenium import webdriver
import requests
content = requests.get("http://stackoverflow.com/").content
driver = webdriver.Chrome()
driver.execute_script("""
document.location = 'about:blank';
document.open();
document.write(arguments[0]);
document.close();
""", content)

Categories