Is it possible to scrape a "dynamical webpage" with beautifulsoup? - python

I am currently begining to use beautifulsoup to scrape websites, I think I got the basics even though I lack theoretical knowledge about webpages, I will do my best to formulate my question.
What I mean with dynamical webpage is the following: a site whose HTML changes based on user action, in my case its collapsible tables.
I want to obtain the data inside some "div" tag but when you load the page, the data seems unavalible in the html code, when you click on the table it expands, and the "class" of this "div" changes from something like "something blabla collapsible" to "something blabla collapsible active" and this I can scrape with my knowledge.
Can I get this data using beautifulsoup? In case I can't, I thought of using something like selenium to click on all the tables and then download the html, which I could scrape, is there an easier way?
Thank you very much.

It depends. If the data is already loaded when the page loads, then the data is available to scrape, it's just in a different element, or being hidden. If the click event triggers loading of the data in some way, then no, you will need Selenium or another headless browser to automate this.
Beautiful soup is only an HTML parser, so whatever data you get by requesting the page is the only data that beautiful soup can access.

Related

Web scraping from dynamic websites in Python and Selenium

I asked a question yesterday and got an answer from #QHarr that dynamic websites like Workday (take https://wd1.myworkdaysite.com/recruiting/upenn/careers-at-penn for example) generate job posts' links by making extra XHR requests. So, if I want to extract specific job post links, the normal webpage scraping using HTML parse or CSS selector by keywords is not feasible while the links cannot be extracted from the HTML source code generated by the Selenium driver. (Based on WeiZhang2017's GitHub post: https://gist.github.com/Weizhang2017/0029b2ff59e943ca9f024c117fbdf88a)
In my case, websites like Workday using Ajax to load data while needed, I used Selenium to simulate page scroll down and get more data as needed. However, as for getting the JSON response using Selenium, I searched a lot but couldn't find an answer that fits my need.
My thought to extract specific job posts' links was by 3 steps in general:
Use Selenium to load and scroll down the website
Use a similar method like request .get().json() in Selenium to get the scrolled down website's JSON response data
Search through the JSON response data with my specific keywords to get the specific posts' links.
However, here comes my questions.
Step1: I did this by a loop to scroll down pages I want. No problem.
scroll = 3
while scroll:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(3)
scroll = scroll -1
Step2: I don't know what kind of method can work after searching a lot and couldn't find an easy-to-understand answer. (I am new to Python and Selenium, limited understanding of dynamic websites scraping )
Step3: I think I could handle the search and get what I want (specific job posts' links) once got the JSON data (assumed it named log) as shown on the Chrome Inspect-Network-Preview.
list = ['https://wd1.myworkdaysite.com' + x['title']['commonlink'] for x in log['body']['children'][0]['children'][0]['listItems'] if x['instance'][0]['text']==mySpecificWords]
Appreciate any thoughts on the step2 solutions.

Python web scraping: websites from google search result

A newbie to Python here. I want to extract info from multiple websites (e.g. 100+) from a google search page. I just want to extract the key info, e.g. those with <h1>, <h2> or <b> or <li> HTML tags etc. But I don't want to extract the entire paragraph <p>.
I know how to gather a list of website URLs from that google search; and I know how to web scrape individual website after looking at the page's HTML. I use the Request and BeautifulSoup for these tasks.
However, I want to know how can I extract key info from all these (100+ !) websites without having to look at their html one by one. Is there a way to automatically find out the HTML tags the website used to emphasize key messages? e.g. some websites may use <h1>, while some may use <b> , or something else...
All I can think of is to come up with a list of possible "emphasis-typed" HTML tags and then just use BeautifulSoup.find_all() to do a wide-scale extraction. But surely there must be an easier way?
It would seem that you must first learn how to do loops and function first. Every website is completely different and scraping a website alone to extract useful information is daunting. I'm a newb myself, but if I have to extract info from headers like you, this is what I would do: (this is just concept code, but hope you'll find it useful)
def getLinks(articleUrl):
html = urlopen('http://en.web.com{}'.format(articleUrl))
bs = BeautifulSoup(html, 'html.parser')
return bs.find('h1', {'class':'header'}).find_all('h1',
header=re.compile('^(/web/)((?!:).)*$'))

Selenium project to Requests

i’ve developed a web scraper that extracts reviews from a particular shopping website. It’s coded by Python and the scraping is used based on Selenium + BS4. But my client thinks it’s TOO SLOW and wants it to be using Requests. To scrape the reviews, I have to wait until the reviews show up (or to click a review tab) and then page through for every reviews. I’m guessing the review div is an xhr element or an ajax because the whole page doesn’t load up when i click the next page. All the scrapings are used by BeautifulSoup.
I’m leaving an url so you guys can all go and check!
https://smartstore.naver.com/hoskus/products/4351439834?NaPm=ct%3Dkeuq83q8%7Cci%3Df5e8bd34633b9b48d81db83b289f1b2e0512d2f0%7Ctr%3Dslsl%7Csn%3D888653%7Chk%3D9822b0c3e9b322fa2d874575218c223ce2454a42
I’ve always thought Requests seem to read the HTML far faster than Selenium. But I don’t know how to attain the HTML when it’s all hidden by buttons. Does anybody have an idea? Or something I can refer to?

Python webscraping how to get only the body html

Hey I am trying to implement a program that can get urls from the html of a website, but I only want the urls from the body. Basically, I want to avoid ads and menus on the website and only get links to the websites that are embedded in the actual article. Does anyone know of a good way of isolating the body html from the rest of the html without hardcoding how the body is designated for each website?
It is a simple process to scrape only specific parts of the html. For the most part you can choose elements from the page you want. Let's say you only want the <div id="example">example</div> you can specify your scraper to only pick up that div. Please check this example out.
https://realpython.com/beautiful-soup-web-scraper-python/

Web Scraping Javascript Using Python

I am used to using BeautifulSoup to scrape a website, however this website is different. Upon soup.prettify() I get back Javascript code, lots of stuff. I want to scrape this website for the data on the actual website (company name, telephone number etc). Is there a way of scraping these scripts such as Main.js to retrieve the data that is displayed on the website to me?
Clear version:
Code is:
<script src="/docs/Main.js" type="text/javascript" language="javascript"></script>
This holds the text that is on the website. I would like to scrape this text however it is populated using JS not HTML (which I used to use BeautifulSoup for).
You're asking if you can scrape text generated at runtime by Javascript. The answer is sort-of.
You'd need to run some kind of headless browser, like PhantomJS, in order to let the Javascript execute and populate the page. You'd then need to feed the HTML that the headless browser generates to BeautifulSoup in order to parse it.

Categories