Webscraping elements with python

Webscraping elements with python - python

Im currently using beautiful soup to try and webscrape a website for data however the python module is reading the source code of the page. In the source code of the page the information i need isn't there however if i right click on the page in chrome and inspect element it is.
i was wondering if there was any way a python module could scrape the elements from a webpage and not the source code
In beautiful soup ive tried to search for the elements like however they just dont come up or appear because its searching in the source code. Im also not sure why or how it doesnt appear there.

When the contents are loaded by JavaScript, you can not get the data via Beautiful Soup. In this situation, the Selenium library is used as it is more useful and handy to extract the required dynamic contents.

Related

Python WebScraping Confusion

I tried to webscrape a HTML webpage, https://streamelements.com/logna/leaderboard, but the HTML code that I can see in inspect element with Firefox is different to the HTML source code of the webpage.
Is it possile to webscrape webpages like this or is there a way to get the code you can see through inspect element?

The Html code seen from inspect tool may differ from the original source code. It is because all the js and php code are rendered by the browser. So, while doing web scraping you should consider the HTML code as seen on browser npt the original source code.
Hope, this will help you.

Is it possible to use Selenium to fetch page source, then use lxml to scrape data by xpath?

Selenium can be used to navigate a web site (login, get html source of a page on the site),
but then there is nothing in Selenium that will find/get data in that HTML by xpath (find_element_by_xpath() will find elements, but not TEXT data outside of tags, and therefore something else must be used like lxml), Selenium absolutely cannot be used to do this, as when you try, it throws an error.
There are no examples anywhere of using Selenium to get the HTML source, passing that to lxml to parse the HTML and find / get data by xpath anywhere on the web.
It is not to be found.
lxml examples are usually given in conjunction with the Python 'requests' library from which the response in bytes (response.content) is obtained.
lxml uses this response.content (bytes), but with lxml, no functions accept the HTML as a string.
Selenium only returns html as a string: self.driver.page_source
So what to do here?
I need to use lxml, because it provides xpath capability.
I cannot use Python's requests library to login to a web site and navigate to a page, it just does not work with this site because of some complexities of how they designed things.
Selenium is the only thing that will work to login, create a session, pass the right cookies on a subsequent GET request.
I need to use selenium and 'page_source' (string), but I am not sure how to convert to the exact 'bytes' that the functions 'lxml' requires.
It's proving quite difficult to scrape using Python with the way the libraries here do not work together and lack of options with Selenium to produce the HTML as bytes,
and the lack of lxml to accept data either as string or bytes.
any and all help would be appreciated, but I don't believe it can be answered unless you have specifically experienced this problem, and have successfully used Selenium + lxml together.

Try something along these lines and see if it works for you:
data = self.driver.page_source
doc = lxml.html.fromstring(data)
target = doc.xpath('some xpath')

Webscraper in python where I provide a webpage that has a list of links which the scraper then visits individually

I am a beginner in programming and I am trying to make a scraper. As of right now I'm using the requests library and BeautifulSoup. I provide the program a link and I am able to extract any information I want from that single web page. What I am trying to accomplish is as follows... I want to provide a web page to the program, the web page that I provide is a search result where there is a list of links that could be clicked. I want the program to be able to get the links of those search results, and then scrape some information from each of those specific pages from the main web page that I provide.
If anyone can give me some sort of guidance on how I could achieve this I would appreciate it greatly! Are there some other libraries I should be using? Is there some reading material you could refer me to, maybe a video?

You can put all the url links in a list then have your request-sending function loop through it. Use the requests or urllib package for this.
For the search logic, you would want to look for the <a> tag with href property.

Webscraping Financial Data from Morningstar

I am trying to scrape data from the morningstar website below:
http://financials.morningstar.com/ratios/r.html?t=IBM&region=USA&culture=en_US
I am currently trying to do just IBM but hope to eventually be able to type in the code of another company and do this same with that one. My code so far is below:
import requests, os, bs4, string
url = 'http://financials.morningstar.com/ratios/r.html?t=IBM&region=USA&culture=en_US';
fin_tbl = ()
page = requests.get(url)
c = page.content
soup = bs4.BeautifulSoup(c, "html.parser")
summary = soup.find("div", {"class":"r_bodywrap"})
tables = summary.find_all('table')
print(tables[0])
The problem I am experiencing at the moment is unlike a simpler webpage I have scraped the program can't seem to locate any tables even though I can see them in the HTML for the page.
In researching this problem the closest stackoverflow question is below:
Python webscraping - NoneObeject Failure - broken HTML?
In that one they explained that Morningstar's tables are dynamically loaded and used some json code I am unfamiliar with and somehow generated a different weblink which managed to scrape the data but I don't understand where it came from?

It's a real problem scraping some modern web pages, particularly on pages generated by single-page applications (where the content is maintained by AJAX calls and DOM modification rather than delivered as ready-to-go HTML in a single server response).
The best way I have found to access such content is to use the Selenium web testing environment to have a browser load the page under the control of my program, then extract the page contents from Selenium for scraping. There are other environments that will execute the scripts and modify the DOM appropriately, but I haven't used any of them.
It's not as difficult as it sounds, but it will take you a little jiggering around to get there.

Web scraping can be greatly simplified when the site offers an API, be it officially supported or just an unofficial hack. Even the hack is better than trying to fiddle with the HTML which can change every day.
So a search for morningstar api might be fruitful. And, in fact, some friendly Gister has already worked this out for you.
Would the search be without result, a usually fruitful approach is to investigate what ajax calls the page is doing to retrieve data and then issue them directly. This can be achieved via the browser debuggers, tab "network" or so where each request can be investigated in detail in a very friendly UI.

I've found scraping dynamic sites to be a lot easier with JavaScript than with Python + Selenium. There is a great module for nodejs/phantomjs: ScraperJS. It is very easy to use: it injects jQuery into the scraped page and you can extract data with jQuery selectors.

BeautifulSoup object not matching a website's html markup in chrome's DeveloperTools

I am tring to crawl this link using Python's BeautifulSoup and urllib2 libraries. One problem that I am running into is that the soup object does not match the webpage's html shown using GoogleChrome's DeveloperTool. I checked multiple times and I am certain that I am passing the correct address. The reason I know they are different is because I printed the entire soup object onto sublime2 and compared it against what is shown on chrome's DeveloperTools. I also searched for really specific tags in the soup object. After debugging for hours, I am out of ideas. Does anyone know why this is happening? Is there some sort of re-direction that is going on?

JavaScript will be run in the website which changes the website DOM. Any url library (such as urllib2) only downloads the HTML and does not execute included/linked JavaScript. That's why you see a difference.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.