Parsing a table on a webpage generated by a script using Python - python

I am trying to scrape the data contained in a table on https://www.bop.gov/coronavirus/. However, when one first visits the page the table is hidden behind a link (https://www.bop.gov/coronavirus/#) that leads to the same page but expands the hidden table on the page. However, I cannot find within this link within the webpage's source code or using selenium in order to expand the table and scrape its data. How can I go about accessing the data in this table using python?

The endpoint from which the data is loaded on the page is available under the network tab of the developer tools. The data you need is loaded from
https://www.bop.gov/coronavirus/json/final.json
You might also want to take a look at
https://www.bop.gov/coronavirus/data/locations.json
as the first link only contains the short codes for the names.

The table data is readily available under the div with id="totals_breakdown".
You can directly call the page_source and parse the data for that element with BeautifulSoup without needing to "show" the element.
If you MUST show the element for some reason, you simply have to remove the class closed from the div with id="totals_breakdown"

Related

Are there any selenium locators present which can scrape any content of a webpage?

Currently i use Python with selenium for scraping purpose. There are many ways in selenium to scrape data. And I used to use css selectors.
But then I realised that,
Only tagNames are those things which always are on websites.
For example,
Not every website uses classes or Id's like, take an example of Wikipedia. They use normally just tags in it.
like <h1>, <a> without having any classes or id in it.
There comes the limitation for scraping USING tagNames, as they scrape every element under their tags.
For example : if I want to scrape table contents which are under <p> tag, then it scrapes the table contents as well as all the descriptions which are not needed.
My question is: is it possible to scrape the required elements under the tags which do not copy every other elements under their tags?
Like if I want to scrape content from, say Amazon then it will select only product names under h1 tags, not scraping all the headings under the h1 tag which are not product names.
If you find any other method/locator to use, even except the tagName also then also you can tell me. But the condition is that it must be present on every website/ most of the websites
Any help would be appreciated 😊...

Webscraping: Table not included in BeautifulSoup Page

I am trying to scrape a table of company info from the table on this page: https://tools.ceres.org/resources/tools/sec-sustainability-disclosure/
I can see the table contents when using chrome's dev tool element inspector, but when I request the page in my script, the contents of the table are gone... just with no content.
Any idea how I can get that sweet, sweet content?
Thanks
Code is below:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://tools.ceres.org/resources/tools/sec-sustainability-disclosure/")
page = BeautifulSoup(response.text, "html.parser")
page
You can find the API in the network traffic tab: it's calling
https://tools.ceres.org/resources/tools/sec-sustainability-disclosure/##api-disclosure?isabstract=0&companyName=&ticker=&year=2018&analysis=1&index=&sic=&keywords=
and you should be able to reconstruct the table from the resulting JSON. I haven't played around with all the parameters but it's seems like only year affects the resulting data set, i.e.
https://tools.ceres.org/resources/tools/sec-sustainability-disclosure/##api-disclosure?isabstract=0&year=2018&analysis=1
should give you the same result as the query above.
Based on the Network traffic using the dev tool, the content isn't directly on the html, but gets called dynamically from ApiService.js script. My suggestion would be to use Selenium to extract the content once the page has fully loaded (for example until the loading element has disappeared).

How to extract google news with specific key word using scrapy?

I am new to scrapy, trying to extract google news from the the given link bellow:
https://www.google.co.in/search?q=cholera+news&safe=strict&source=lnms&tbm=nws&sa=X&ved=0ahUKEwik0KLV-JfYAhWLpY8KHVpaAL0Q_AUICigB&biw=1863&bih=966
"cholera" key word was provided that shows small blocks of various news associated with cholera key world further I try this with scrapy to extract the each block that contents individual news.
fetch("https://www.google.co.in/search?q=cholera+news&safe=strict&source=lnms&tbm=nws&sa=X&ved=0ahUKEwik0KLV-JfYAhWLpY8KHVpaAL0Q_AUICigB&biw=1863&bih=966")
response.css(".ts._JGs._KHs._oGs._KGs._jHs::text").extract()
where .ts._JGs._KHs._oGs._KGs._jHs::text represent the div class="ts _JGs _KHs _oGs _KGs _jHs for each block of news.
but it return None.
After struggling I find out a way to scrap desired data with very simple trick,
fetch("https://www.google.co.in/search?q=cholera+news&safe=strict&source=lnms&tbm=nws&sa=X&ved=0ahUKEwik0KLV-JfYAhWLpY8KHVpaAL0Q_AUICigB&biw=1863&bih=966")
and css selector "class="g" tag can be used to extract desired block like this
response.css(".g").extract()
which return list of all the individual news blocks which can be further used on the basis of list index like this:
response.css(".g").extract()[0]
or
response.css(".g").extract()[1]
In scrapy shell uses view(response) and you will see in web browser what you fetch().
Google uses JavaScript to display data, but it can also send page which doesn't use JavaScript. But page without JavaScript usually has different tags and classes.
You can also turn off JavaScript in your browse and then open Google to see tags.
Try this:
response.css('#search td ::text').extract()

How to scrape value from page that loads dynamicaly?

The homepage of the website I'm trying to scrape displays four tabs, one of which reads "[Number] Available Jobs". I'm interested in scraping the [Number] value. When I inspect the page in Chrome, I can see the value enclosed within a <span> tag.
However, there is nothing enclosed in that <span> tag when I view the page source directly. I was planning on using the Python requests module to make an HTTP GET request and then use regex to capture the value from the returned content. This is obviously not possible if the content doesn't contain the number I need.
My questions are:
What is happening here? How can a value be dynamically loaded into a
page, displayed, and then not appear within the HTML source?
If the value doesn't appear in the page source, what can I do to
reach it?
If the content doesn't appear in the page source then it is probably generated using javascript. For example the site might have a REST API that lists jobs, and the Javascript code could request the jobs from the API and use it to create the node in the DOM and attach it to the available jobs. That's just one possibility.
One way to scrap this information is to figure out how that javascript works and make your python scraper do the same thing (for example, if there is a simple REST API it is using, you just need to make a request to that same URL). Often that is not so easy, so another alternative is to do your scraping using a javascript capable browser like selenium.
One final thing I want to mention is that regular expressions are a fragile way to parse HTML, you should generally prefer to use a library like BeautifulSoup.
1.A value can be loaded dynamically with ajax, ajax loads asynchronously that means that the rest of the site does not wait for ajax to be rendered, that's why when you get the DOM the elements loaded with ajax does not appear in it.
2.For scraping dynamic content you should use selenium, here a tutorial
for data that load dynamically you should look for an xhr request in the networks and if you can make that data productive for you than voila!!
you can you phantom js, it's a headless browser and it captures the html of that page with the dynamically loaded content.

Python - How to scrape Tr/Td table data using 'requests & BeautifulSoup'

I'm new to programming. I'm trying out my first Web Crawler program that will help me with my job. I'm trying to build a program that will scrape tr/td table data from a web page, but am having difficulties succeeding. Here is what I have so far:
import requests
from bs4 import BeautifulSoup
def start(url):
source_code = requests.get(url).text
soup = BeautifulSoup(source_code)
for table_data in soup.find_all('td', {'class': 'sorting_1'}):
print(table_data)
start('http://www.datatables.net/')
My goal is to print out each line and then export it to an excel file.
Thank you,
-Cire
My recommendation is that if you are new to Python, play with things via the iPython notebook (interactive prompt) to get things working first and to get a feel for things before you try writing a script or a function. On the plus side all variables will stick around and it is much easier to see what is going on.
From the screen shot here, you can see immediately that the find_all function is not finding anything. An empty lists [] is being returned. By using ipython you can easily try other variants of a function on a previously defined variable. For example, the soup.find_all('td').
Looking at the source of http://www.datatables.net, I do not see any instances of the text sorting_1, so I wouldn't expect a search for all table cells of that class to return anything.
Perhaps that class appeared on a different URL associated with the DataTables website, in which case you would need to use that URL in your code. It's also possible that that class only appears after certain JavaScript has been run client-side (i.e. after certain actions with the sample tables, perhaps), and not on the initially loaded page.
I'd recommend starting with tags you know are on the initial page (seen by looking at the page source in your browser).
For example, currently, I can see a div with class="content". So the find_all code could be changed to the following:
for table_data in soup.find_all('div', {'class': 'content'}):
print(table_data)
And that should find something.
Response to comments from OP:
The precise reason why you're not finding that tag/class pairing in this case is that DataTables renders the table client-side via JavaScript, generally after the DOM has finished loading (although it depends on the page and where the DataTables init code is placed). That means the HTML associated with the base URL does not contain this content. You can see this if you curl the base URL and look at the output.
However when loading it in a browser, once the JavaScript for DataTables fires, the table is rendered and the DOM is dynamically modified to add the table, including cells with the class for which you're looking.

Categories