Web scraping from dynamic websites in Python and Selenium - python

I asked a question yesterday and got an answer from #QHarr that dynamic websites like Workday (take https://wd1.myworkdaysite.com/recruiting/upenn/careers-at-penn for example) generate job posts' links by making extra XHR requests. So, if I want to extract specific job post links, the normal webpage scraping using HTML parse or CSS selector by keywords is not feasible while the links cannot be extracted from the HTML source code generated by the Selenium driver. (Based on WeiZhang2017's GitHub post: https://gist.github.com/Weizhang2017/0029b2ff59e943ca9f024c117fbdf88a)
In my case, websites like Workday using Ajax to load data while needed, I used Selenium to simulate page scroll down and get more data as needed. However, as for getting the JSON response using Selenium, I searched a lot but couldn't find an answer that fits my need.
My thought to extract specific job posts' links was by 3 steps in general:
Use Selenium to load and scroll down the website
Use a similar method like request .get().json() in Selenium to get the scrolled down website's JSON response data
Search through the JSON response data with my specific keywords to get the specific posts' links.
However, here comes my questions.
Step1: I did this by a loop to scroll down pages I want. No problem.
scroll = 3
while scroll:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(3)
scroll = scroll -1
Step2: I don't know what kind of method can work after searching a lot and couldn't find an easy-to-understand answer. (I am new to Python and Selenium, limited understanding of dynamic websites scraping )
Step3: I think I could handle the search and get what I want (specific job posts' links) once got the JSON data (assumed it named log) as shown on the Chrome Inspect-Network-Preview.
list = ['https://wd1.myworkdaysite.com' + x['title']['commonlink'] for x in log['body']['children'][0]['children'][0]['listItems'] if x['instance'][0]['text']==mySpecificWords]
Appreciate any thoughts on the step2 solutions.

Related

How to loop through each page of website for web scraping with BeautifulSoup

I am scraping job posting data from a website using BeautifulSoup. I have working code that does what I need, but it only scrapes the first page of job postings. I am having trouble figuring out how to iteratively update the url to scrape each page. I am new to Python and have looked at a few different solutions to similar questions, but have not figured out how to apply them to my particular url. I think I need to iteratively update the url or somehow click the next button and then loop my existing code through each page. I appreciate any solutions.
url: https://jobs.utcaerospacesystems.com/search-jobs
First, BeautifulSoup doesn't have anything to do with GETing web pages - you get the webpage yourself, then feed it to bs4 for processing.
The problem with the page you linked is that it's javascript - it only renders correctly in a browser (or any other javascript VM).
#Fabricator is on the right track - you'll need to watch the developer console and see what the ajax requests the js is sending to the server. In this case, also take a look at the query string params, which include a param called CurrentPage - that's probably the one you want to focus on.

Python - Beautiful Soup to grab emails from website

I've been trying to figure out a simple way to run through a set of URLs that lead to pages that all have the same layout. We figured out that one issue is that in the original list the URLs are http but then they redirect to https. I am not sure if that then causes a problem in trying to pull the information from the page. I can see the structure of the page when I use Inspector in Chrome, but when I try to set up the code to grab relevant links I come up empty (literally). The most general code I have been using is:
soup = BeautifulSoup(urllib2.urlopen('https://ngcproject.org/program/algirls').read())
links = SoupStrainer('a')
print links
which yields:
a|{}
Given that I'm new to this I've been trying to work with anything that I think might work. I also tried:
mail = soup.find(attrs={'class':'tc-connect-details_send-email'}).a['href']
and
spans = soup.find_all('span', {'class' : 'tc-connect-details_send-email'})
lines = [span.get_text() for span in spans]
print lines
but these don't yield anything either.
I am assuming that it's an issue with my code and not one that the data are hidden from being scraped. Ideally I want to have the data passed to a CSV file for each URL I scrape but right now I need to be able to confirm that the code is actually grabbing the right information. Any suggestions welcome!
If you press CTRL+U on Google Chrome or Right click > view source.
You'll see that the page is rendered by using javascript or other.
urllib is not going to be able to display/download what you're looking for.
You'll have to use automated browser (Selenium - most popular) and you can use it with Google Chrome / Firefox or a headless browser (PhantomJS).
You can then get the information from Selenium and store it then manipulate it in anyway you see fit.

Scraping the content of a box contains infinite scrolling in Python

I am new to Python and web crawling. I intend to scrape links in the top stories of a website. I was told to look at to its Ajax requests and send similar ones. The problem is that all requests for the links are same: http://www.marketwatch.com/newsviewer/mktwheadlines
My question would be how to extract links from an infinite scrolling box like this. I am using beautiful soup, but I think it's not suitable for this task. I am also not familiar with Selenium and java scripts. I know how to scrape certain requests by Scrapy though.
It is indeed an AJAX request. If you take a look at the network tab in your browser inspector:
You can see that it's making a POST request to download the urls to the articles.
Every value is self explanatory in here except maybe for docid and timestamp. docid seems to indicate which box to pull articles for(there are multiple boxes on the page) and it seems to be the id attached to <li> element under which the article urls are stored.
Fortunately in this case POST and GET are interchangable. Also timestamp paremeter doesn't seem to be required. So in all you can actually view the results in your browser, by right clicking the url in the inspector and selecting "copy location with parameters":
http://www.marketwatch.com/newsviewer/mktwheadlines?blogs=true&commentary=true&docId=1275261016&premium=true&pullCount=100&pulse=true&rtheadlines=true&topic=All%20Topics&topstories=true&video=true
This example has timestamp parameter removed as well as increased pullCount to 100, so simply request it, it will return 100 of article urls.
You can mess around more to reverse engineer how the website does it and what the use of every keyword, but this is a good start.

Webscraping Financial Data from Morningstar

I am trying to scrape data from the morningstar website below:
http://financials.morningstar.com/ratios/r.html?t=IBM&region=USA&culture=en_US
I am currently trying to do just IBM but hope to eventually be able to type in the code of another company and do this same with that one. My code so far is below:
import requests, os, bs4, string
url = 'http://financials.morningstar.com/ratios/r.html?t=IBM&region=USA&culture=en_US';
fin_tbl = ()
page = requests.get(url)
c = page.content
soup = bs4.BeautifulSoup(c, "html.parser")
summary = soup.find("div", {"class":"r_bodywrap"})
tables = summary.find_all('table')
print(tables[0])
The problem I am experiencing at the moment is unlike a simpler webpage I have scraped the program can't seem to locate any tables even though I can see them in the HTML for the page.
In researching this problem the closest stackoverflow question is below:
Python webscraping - NoneObeject Failure - broken HTML?
In that one they explained that Morningstar's tables are dynamically loaded and used some json code I am unfamiliar with and somehow generated a different weblink which managed to scrape the data but I don't understand where it came from?
It's a real problem scraping some modern web pages, particularly on pages generated by single-page applications (where the content is maintained by AJAX calls and DOM modification rather than delivered as ready-to-go HTML in a single server response).
The best way I have found to access such content is to use the Selenium web testing environment to have a browser load the page under the control of my program, then extract the page contents from Selenium for scraping. There are other environments that will execute the scripts and modify the DOM appropriately, but I haven't used any of them.
It's not as difficult as it sounds, but it will take you a little jiggering around to get there.
Web scraping can be greatly simplified when the site offers an API, be it officially supported or just an unofficial hack. Even the hack is better than trying to fiddle with the HTML which can change every day.
So a search for morningstar api might be fruitful. And, in fact, some friendly Gister has already worked this out for you.
Would the search be without result, a usually fruitful approach is to investigate what ajax calls the page is doing to retrieve data and then issue them directly. This can be achieved via the browser debuggers, tab "network" or so where each request can be investigated in detail in a very friendly UI.
I've found scraping dynamic sites to be a lot easier with JavaScript than with Python + Selenium. There is a great module for nodejs/phantomjs: ScraperJS. It is very easy to use: it injects jQuery into the scraped page and you can extract data with jQuery selectors.

Scrape all the pages of a website when next page's follow-up link is not available in the current page source code

Hi i have successfully scraped all the pages of few shopping websites by using Python and Regular Expression.
But now i am in trouble to scrape all the pages of a particular website where next page follow up link is not present in current page like this one here http://www.jabong.com/men/clothing/mens-jeans/
This website is loading the next pages data in same page dynamically by Ajax calls. So while scraping i am only able to scrape the data of First page only. But I need to scrape all the items present in all pages of that website.
I am not getting a way to get the source code of all the pages of these type of websites where next page's follow up link is not available in Current page. Please help me through this.
Looks like the site is using AJAX requests to get more search results as the user scrolls down. The initial set of search results can be found in the main request:
http://www.jabong.com/men/clothing/mens-jeans/
As the user scrolls down the page detects when they reach the end of the current set of results, and loads the next set, as needed:
http://www.jabong.com/men/clothing/mens-jeans/?page=2
One approach would be to simply keep requesting subsequent pages till you find a page with no results.
By the way, I was able to determine this by using the proxy tool in screen-scraper. You could also use a tool like Charles or HttpFox. They key is to browse the site and watch what HTTP requests get made so that you can mimic them in your code.

Categories