I am trying to scrape a table of company info from the table on this page: https://tools.ceres.org/resources/tools/sec-sustainability-disclosure/
I can see the table contents when using chrome's dev tool element inspector, but when I request the page in my script, the contents of the table are gone... just with no content.
Any idea how I can get that sweet, sweet content?
Thanks
Code is below:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://tools.ceres.org/resources/tools/sec-sustainability-disclosure/")
page = BeautifulSoup(response.text, "html.parser")
page
You can find the API in the network traffic tab: it's calling
https://tools.ceres.org/resources/tools/sec-sustainability-disclosure/##api-disclosure?isabstract=0&companyName=&ticker=&year=2018&analysis=1&index=&sic=&keywords=
and you should be able to reconstruct the table from the resulting JSON. I haven't played around with all the parameters but it's seems like only year affects the resulting data set, i.e.
https://tools.ceres.org/resources/tools/sec-sustainability-disclosure/##api-disclosure?isabstract=0&year=2018&analysis=1
should give you the same result as the query above.
Based on the Network traffic using the dev tool, the content isn't directly on the html, but gets called dynamically from ApiService.js script. My suggestion would be to use Selenium to extract the content once the page has fully loaded (for example until the loading element has disappeared).
Related
Simple question. Why is it that when I inspect element I see the data I want embedded within the JS tags - but when I go directly to Page Source I do not see it at all?
As an example, basically I am looking to get the description of the eBay listing. In this case, the text in the body of the listing that reads "BRAND NEW Factory Sealed
Playstation 5 (PS5) Bluray Disc System Console [...]
We usually ship within 24 hours of purchase."
Sample code below. If I search for the text within the printout, I cannot find it.
import requests
from bs4 import BeautifulSoup
url = 'www.ebay.com/itm/272037717929'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
print(soup.prettify())
it's probably because ebay is using javascript to load content into the page. A workout this problem would be using something playwright or selenium. I personally prefer the first option. It uses a chromium browser to actually get the page contents, hence loads javascript in the proccess
I am trying to make a coronavirus tracker using beautifulsoup just for some practice.
my code is,
import requests
from bs4 import BeautifulSoup
page=requests.get("https://sample.com")
soup=BeautifulSoup(page.content,'html.parser')
table=soup.find("div",class_="ZDcxi")
print(table)
In the output its showing none, but the div tag with the class ZDcxi do have content.
please help
The data, which you see in the browser, and includes the target div, is dynamic content, generated by scripts included with the page and run in the browser. If you just search for the class name in page.content, you will find it is not there.
What many people do is use selenium to open desired pages through Chrome (or another web browser), and then, after the page finishes loading and generating dynamic content, use BeautifulSoup to harvest the content from the browser, and continue processing from there.
Find out more at Requests vs Selenium Python, and also when you search selenium vs requests/
from bs4 import BeautifulSoup
import requests
url = "https://www.deribit.com/main#/options?tab=all"
content = requests.get(url).content
soup = BeautifulSoup(content,'html.parser')
I'm trying to grab all the data on the bottom of the page where it says "Recent Trades CALLS" and "Recent Trades PUTS". I have tried variations of:
soup.find_all(div', {'class': 'row'})
soup.find_all('tbody')
But to no avail. To clarify I would like to grab the entire tables' worth of data including all the columns like (Assets, Price, etc).
That is dynamic data, is is not contained in the page you request but gets loaded by javascript after initial page load... you see 3 blue dots working in the middle of the screen while actually getting the data, you have 2 options for this:
Use Chrom/Firefox dev tools to snoop on the network pane for the calls to get the data you want and try to emulate those calls, cookies, headers, parameters, all.
Use an actual browser that will load the full page before scraping it, for this you can use selenium
Sorry if this is a silly question.
I am trying to use Beautifulsoup and urllib2 in python to look at a url and extract all divs with a particular class. However, the result is always empty even though I can see the divs when I "inspect element" in chrome's developer tools.
I looked at the page source and those divs were not there which means they were inserted by a script. So my question is how can i look for those divs (using their class name) using Beautifulsoup? I want to eventually read and follow hrefs under those divs.
Thanks.
[Edit]
I am currently looking at the H&M website: http://www.hm.com/sg/products/ladies and I am interested to get all the divs with class 'product-list-item'
Try using selenium to run the javascript
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.python.org")
html = driver.page_source
check this link enter link description here
you can get all info by change the url, this link can be found in chrome dev tools > Network
The reason why you got nothing from that specific url is simply because, the info you need is not there.
So first let me explain a little bit about how that page is loaded in a browser: when you request for that page(http://www.hm.com/sg/products/ladies), the literal content will be returned in the very first phase(which is what you got from your urllib2 request), then the browser starts to read/parse the content, basically it tells the browser where to find all information it needs to render the whole page(e.g. CSS to control layout, additional javascript/urls/pages to populate certain area etc.), and the browser does all that behind the scene. When you "inspect element" in chrome, the page is already fully loaded, and those info you want is not in original url, so you need to find out which url is used to populate those area and go after that specific url instead.
So now we need to find out what happens behind the scene, and a tool is needed to capture all traffic when that page loads(I would recommend fiddler).
As you can see, lots of things happen when you open that page in a browser!(and that's only part of the whole page-loading process) So by educated guess, those info you need should be in one of those three "api.hm.com" requests, and the best part is they are alread JSON formatted, which means you might not even bother with BeautifulSoup, the built-in json module could do the job!
OK, now what? Use urllib2 to simulate those requests and get what you want.
P.S. requests is a great tool for this kind of job, you can get it here.
Try This one :
from bs4 import BeautifulSoup
import urllib2
page = urllib2.urlopen("http://www.hm.com/sg/products/ladies")
soup = BeautifulSoup(page.read(),'lxml')
scrapdiv = open('scrapdiv.txt','w')
product_lists = soup.findAll("div",{"class":"o-product-list"})
print product_lists
for product_list in product_lists:
print product_list
scrapdiv.write(str(product_list))
scrapdiv.write("\n\n")
scrapdiv.close()
I'm new to programming. I'm trying out my first Web Crawler program that will help me with my job. I'm trying to build a program that will scrape tr/td table data from a web page, but am having difficulties succeeding. Here is what I have so far:
import requests
from bs4 import BeautifulSoup
def start(url):
source_code = requests.get(url).text
soup = BeautifulSoup(source_code)
for table_data in soup.find_all('td', {'class': 'sorting_1'}):
print(table_data)
start('http://www.datatables.net/')
My goal is to print out each line and then export it to an excel file.
Thank you,
-Cire
My recommendation is that if you are new to Python, play with things via the iPython notebook (interactive prompt) to get things working first and to get a feel for things before you try writing a script or a function. On the plus side all variables will stick around and it is much easier to see what is going on.
From the screen shot here, you can see immediately that the find_all function is not finding anything. An empty lists [] is being returned. By using ipython you can easily try other variants of a function on a previously defined variable. For example, the soup.find_all('td').
Looking at the source of http://www.datatables.net, I do not see any instances of the text sorting_1, so I wouldn't expect a search for all table cells of that class to return anything.
Perhaps that class appeared on a different URL associated with the DataTables website, in which case you would need to use that URL in your code. It's also possible that that class only appears after certain JavaScript has been run client-side (i.e. after certain actions with the sample tables, perhaps), and not on the initially loaded page.
I'd recommend starting with tags you know are on the initial page (seen by looking at the page source in your browser).
For example, currently, I can see a div with class="content". So the find_all code could be changed to the following:
for table_data in soup.find_all('div', {'class': 'content'}):
print(table_data)
And that should find something.
Response to comments from OP:
The precise reason why you're not finding that tag/class pairing in this case is that DataTables renders the table client-side via JavaScript, generally after the DOM has finished loading (although it depends on the page and where the DataTables init code is placed). That means the HTML associated with the base URL does not contain this content. You can see this if you curl the base URL and look at the output.
However when loading it in a browser, once the JavaScript for DataTables fires, the table is rendered and the DOM is dynamically modified to add the table, including cells with the class for which you're looking.