I am executing a python script to scrape data off the first 5 pages of searching papers on the SSRN database. Currently what I have working right now is
links = browser.find_elements_by_xpath("//h3//a")
for link in links:
href = link.get_attribute("href")
extract_information(href,sheet,excel_data_pointer,workbook)
This is all currently being done on page 1 of my website. I also have some additional search preferences specified in my code namely:
Where this is the last option on the search dropdown. I access this on the first page by using:
select_element = Select(browser.find_element_by_css_selector('select#sort-by'))
#Modify me to change the Sorting filter
select_element.select_by_visible_text("Date Posted, Descending")
browser.implicitly_wait(2)
Then I run the code I stated at the start, extract the links and use a helper function to perform my scraping. My problem begins in the second page. When it loads the browser pauses for a second and switches back to the old search preferences and now it looks like:
I am using a for loop to traverse through the pages in the script accessing them by:
for i in range(2,num_papers):
browser.find_element_by_xpath("//*[#id='maincontent']/div/div[1]/div/div[3]/div/ul/li["+ str(i) +"]/a").click()
My first question would be why does this happen? Does the browser not automatically remember search preferences? I have tried accessing the drop down again on this page but it doesn't seem to change. Is there a way to have selenium browsers memorise these preferences?
EDIT: I was also searching by "Title, Abstract and Keywords" in my original search which worked well on page 1 but by page 2 it started to search through "Title and Abstract" only (the default option)
Related
I asked a question yesterday and got an answer from #QHarr that dynamic websites like Workday (take https://wd1.myworkdaysite.com/recruiting/upenn/careers-at-penn for example) generate job posts' links by making extra XHR requests. So, if I want to extract specific job post links, the normal webpage scraping using HTML parse or CSS selector by keywords is not feasible while the links cannot be extracted from the HTML source code generated by the Selenium driver. (Based on WeiZhang2017's GitHub post: https://gist.github.com/Weizhang2017/0029b2ff59e943ca9f024c117fbdf88a)
In my case, websites like Workday using Ajax to load data while needed, I used Selenium to simulate page scroll down and get more data as needed. However, as for getting the JSON response using Selenium, I searched a lot but couldn't find an answer that fits my need.
My thought to extract specific job posts' links was by 3 steps in general:
Use Selenium to load and scroll down the website
Use a similar method like request .get().json() in Selenium to get the scrolled down website's JSON response data
Search through the JSON response data with my specific keywords to get the specific posts' links.
However, here comes my questions.
Step1: I did this by a loop to scroll down pages I want. No problem.
scroll = 3
while scroll:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(3)
scroll = scroll -1
Step2: I don't know what kind of method can work after searching a lot and couldn't find an easy-to-understand answer. (I am new to Python and Selenium, limited understanding of dynamic websites scraping )
Step3: I think I could handle the search and get what I want (specific job posts' links) once got the JSON data (assumed it named log) as shown on the Chrome Inspect-Network-Preview.
list = ['https://wd1.myworkdaysite.com' + x['title']['commonlink'] for x in log['body']['children'][0]['children'][0]['listItems'] if x['instance'][0]['text']==mySpecificWords]
Appreciate any thoughts on the step2 solutions.
I have dabbled with bits of simple code over the years. I am now interested in automating some repetitive steps in a web based CRM used at work. I tried a few automation tools. I was not able to get AutoIT to to work with the Chrome webdriver. I then tried WinTask and did not make meaningful progress. I started exploring Python and Selenium last week.
I now have automated the first few steps of my project by Googling about each step I wanted to achieve, learning from pages on Stackflow and other sites. Where I need help is that most of the links in the CRM are some sort of javascript links. Most of the text links or images have links that are formatted like this...
javascript:window.location = 'Reports/ResponseTimes.aspx?from=1%2f14%2f2021&to=1%2f14%2f2021&target=gn';
It looks like the many find_element_by functions in Selenium do not interact with the javascript links. Tonight I found a page that directed me to use... driver.execute_script(javaScript) ...Eventually I found an example that made it clear I should enter the javascript link into that function. This works...
driver.execute_script("window.location = 'Reports/ResponseTimes.aspx?from=1%2f14%2f2021&to=1%2f14%2f2021&target=gn';")
My issue is that I see now that the javascript links are actually and dynamically generated. In the code above the link gets updated with dates based on the current date. I can't reuse the driver.execute_script() code above since the dates have to be updated.
My hope is to find a way to code so that I can locate the javascript links I need based on some part of the link that does not change. The link above always has "target=gn" at the end and that is unique enough that if I could find and pull the current version of the link into a variable and then run it in driver.execute_script(), I believe that would solve my current issue.
I expect a solution could then be used in the next step I need to perform, where there a list of new leads that all needs to be updated in a manner that tells the system a human has reviewed the lead and "stopped the clock". To view each lead, there are more javascript links. Each link is unique since it includes a value that is the record number for the lead. Here's the first two...
javascript:top.viewItem(971244899);
javascript:top.viewItem(971312602);
I imagine that being able to search the page for some or all of... javascript:top.viewItem( ...in order to create a variable for... javascript:top.viewItem(971244899); ...so that it can be placed in... driver.execute_script() ...is the approach that is needed.
Thanks for any suggestions. I have made many searches on this site and Google for phrases that might teach me more about working with javascript links. I am asking for guidance since I have not been able to move forward on my own. Here's my current code...
import selenium
PATH = "C:\Program Files (x86)\chromedriver.exe"
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Chrome(PATH)
driver.get("https://apps.vinmanager.com/cardashboard/login.aspx")
# log in
time.sleep(1)
search = driver.find_element_by_name("username")
search.send_keys("xxx")
search.send_keys(Keys.RETURN)
time.sleep(2)
search = driver.find_element_by_name("password")
search.send_keys("xxx")
search.send_keys(Keys.RETURN)
time.sleep(1)
# close news pop-up
driver.find_element_by_link_text("Close").click()
time.sleep(2)
# Nav to left pane
driver.switch_to.frame('leftpaneframe')
# Leads at No Contact link
driver.execute_script("window.location = 'Reports/ResponseTimes.aspx?from=1%2f14%2f2021&to=1%2f14%2f2021&target=gn';")
Eventually I found enough info online to recognize that I needed to replace the "//a" tag in the xpath find method with the proper tag, which was "//area" in my case and then extract the href so that I could execute it...
## click no contact link ##
print('click no contact link...')
cncl = driver.find_element_by_xpath("//area[contains(#href,'target=gn')]").get_attribute('href')
time.sleep(2)
driver.execute_script(cncl)
Trying to scrape a website which contains some dynamic data. These data could not be captured with python selenium driver.page_source even after adding lot of waits in it. But when I inspected with firebug, came to know that, these values were executed within browser console using referenced javascripts.
And, completely inspected the page source taken from selenium for these values and found no traces. All I can see only its id
But in actual browser, these values are present in it.
I've been trying to figure out a simple way to run through a set of URLs that lead to pages that all have the same layout. We figured out that one issue is that in the original list the URLs are http but then they redirect to https. I am not sure if that then causes a problem in trying to pull the information from the page. I can see the structure of the page when I use Inspector in Chrome, but when I try to set up the code to grab relevant links I come up empty (literally). The most general code I have been using is:
soup = BeautifulSoup(urllib2.urlopen('https://ngcproject.org/program/algirls').read())
links = SoupStrainer('a')
print links
which yields:
a|{}
Given that I'm new to this I've been trying to work with anything that I think might work. I also tried:
mail = soup.find(attrs={'class':'tc-connect-details_send-email'}).a['href']
and
spans = soup.find_all('span', {'class' : 'tc-connect-details_send-email'})
lines = [span.get_text() for span in spans]
print lines
but these don't yield anything either.
I am assuming that it's an issue with my code and not one that the data are hidden from being scraped. Ideally I want to have the data passed to a CSV file for each URL I scrape but right now I need to be able to confirm that the code is actually grabbing the right information. Any suggestions welcome!
If you press CTRL+U on Google Chrome or Right click > view source.
You'll see that the page is rendered by using javascript or other.
urllib is not going to be able to display/download what you're looking for.
You'll have to use automated browser (Selenium - most popular) and you can use it with Google Chrome / Firefox or a headless browser (PhantomJS).
You can then get the information from Selenium and store it then manipulate it in anyway you see fit.
Hi i have successfully scraped all the pages of few shopping websites by using Python and Regular Expression.
But now i am in trouble to scrape all the pages of a particular website where next page follow up link is not present in current page like this one here http://www.jabong.com/men/clothing/mens-jeans/
This website is loading the next pages data in same page dynamically by Ajax calls. So while scraping i am only able to scrape the data of First page only. But I need to scrape all the items present in all pages of that website.
I am not getting a way to get the source code of all the pages of these type of websites where next page's follow up link is not available in Current page. Please help me through this.
Looks like the site is using AJAX requests to get more search results as the user scrolls down. The initial set of search results can be found in the main request:
http://www.jabong.com/men/clothing/mens-jeans/
As the user scrolls down the page detects when they reach the end of the current set of results, and loads the next set, as needed:
http://www.jabong.com/men/clothing/mens-jeans/?page=2
One approach would be to simply keep requesting subsequent pages till you find a page with no results.
By the way, I was able to determine this by using the proxy tool in screen-scraper. You could also use a tool like Charles or HttpFox. They key is to browse the site and watch what HTTP requests get made so that you can mimic them in your code.