can not find table content (hidden table) when scrapy on a web - python

I am trying to scrape the following url (http://cmegroup.com/clearing/operations-and-deliveries/accepted-trade-types/block-data.html/#contractTypes=FUT&exchanges=XNYM&assetClassId=0), the table content is what I'm interested, however looks like the table is hidden at somewhere:
Right click the inspection on the table, I can get ==$0 (following by )
But at scrapy shell, if I do response.xpath('//*[#table]'), it returns nothing which means I can't scrape the content by this way....
Please help on this issue, thanks.
UPDATE: The final solution is by using Selenium (great tool) for this scrapy task, and selenium is especially useful when the web page content such as tables and etc. is java encrypted, there are tons of selenium instruction to be found in the community, here is one example.

The reason the table is empty is that you are trying to scrapy the wrong url that contains data of table, the correct is:
http://www.cmegroup.com/CmeWS/mvc/xsltTransformer.do?xlstDoc=/XSLT/md/blocks-records.xsl&url=/da/BlockTradeQuotes/V1/Block/BlockTrades?exchange=XCBT,XCME,XCEC,DUMX,XNYM&foi=FUT,OPT,SPD&assetClassId=0&tradeDate=05172018&sortCol=time&sortBy=desc
The "05172018" text on url above looks like a date filter with this format: MMDDYYYY.

Related

Webscraping: Table not included in BeautifulSoup Page

I am trying to scrape a table of company info from the table on this page: https://tools.ceres.org/resources/tools/sec-sustainability-disclosure/
I can see the table contents when using chrome's dev tool element inspector, but when I request the page in my script, the contents of the table are gone... just with no content.
Any idea how I can get that sweet, sweet content?
Thanks
Code is below:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://tools.ceres.org/resources/tools/sec-sustainability-disclosure/")
page = BeautifulSoup(response.text, "html.parser")
page
You can find the API in the network traffic tab: it's calling
https://tools.ceres.org/resources/tools/sec-sustainability-disclosure/##api-disclosure?isabstract=0&companyName=&ticker=&year=2018&analysis=1&index=&sic=&keywords=
and you should be able to reconstruct the table from the resulting JSON. I haven't played around with all the parameters but it's seems like only year affects the resulting data set, i.e.
https://tools.ceres.org/resources/tools/sec-sustainability-disclosure/##api-disclosure?isabstract=0&year=2018&analysis=1
should give you the same result as the query above.
Based on the Network traffic using the dev tool, the content isn't directly on the html, but gets called dynamically from ApiService.js script. My suggestion would be to use Selenium to extract the content once the page has fully loaded (for example until the loading element has disappeared).

How do I obtain data, with scrapy, in a web page in which I do not see that there is the code I want to scrape

I'm trying to get the names of the users and the content of the comments that exist on this page:
User and text that I need to extract:
When I test the extraction with the chrome plugin Xpath helper, I am getting the user names with the statement:
//*[#id="livefyre"]/div/div/div/div/article/div/header/a/span
and the comments, I get them with:
//*[#id="livefyre"]/div/div/div/div/article/div/section/div/p
When I do the test in the scrapy console, with the query:
response.xpath(//*[#id="livefyre"]/div/div/div/div/article/div/section/div/p).extract()
I get a [];
I've also tried with:
response.xpath (//*[#id="livefyre"]/div/div/div/div/article/div/section/div/p.text()).extract()
The same thing happens with my code.
Verifying the code of the page, I see that all those comments do not exist in the html code.
When I inspect the page, for example, I see the comment text:
But when, I check the html code of the page I do not see anything
:
Where am I making a mistake?
Thanks for help.
As you stated, there isn't any comment in the code of page, that mean website is being rendered through javascript, There are two ways you can scrape these kind of websites
First,
use scrapy-splash to render javascript
second,
find the api/network call that brings the comments, mock that request in scrapy to get your data.

Python - Scrapy ecommerce website

I'm trying to scrape the price of this product
http://www.asos.com/au/fila/fila-vintage-plus-ringer-t-shirt-with-small-logo-in-green/prd/9065343?clr=green&SearchQuery=&cid=7616&gridcolumn=2&gridrow=1&gridsize=4&pge=1&pgesize=72&totalstyles=4699
With the following code but it returns an empty array
response.xpath('//*[#id="product-price"]/div/span[2]/text()').extract()
Any help is appreciated, Thanks.
Because the site is dynamic(this is what I got when I use view(response) command in scrapy shell:
As you can see, the price info doesn't come out.
Solutions:
1. splash.
2. selenium+phantomJS
It might help also by checking this answer:Empty List From Scrapy When Using Xpath to Extract Values
The price is later added by the browser which renders the page using javascript code found in the html. If you disable javascript in your browser, you would notice that the page would look a bit different. Also, take a look at the page source, usually that's unaltered, to see that the tag you're looking for doesn't exist (yet).
Scrapy doesn't execute any javascript code. It receives the plain html and that's what you have to work with.
If you want to extract data from pages which look the same as in the browser, I recommend using an headless browser like Splash (if you're already using scrapy): https://github.com/scrapinghub/splash
You can programaticaly tell it to download your page, render it and select the data points you're interested in.
The other way is to check for the request made to the Asos API which asks for the product data. In your case, for this product:
http://www.asos.com/api/product/catalogue/v2/stockprice?productIds=9065343&currency=AUD&keyStoreDataversion=0ggz8b-4.1&store=AU
I got this url by taking a look at all the XMLHttpRequest (XHR) requests sent in the Network tab found in Developers Tools (on Google Chrome).
You can try to find JSON inside HTML (using regular expression) and parse it:
json_string = response.xpath('//script[contains(., "function (view) {")]/text()').re_first( r'view\(\'([^\']+)' )
data = json.loads(json_string)
price = data["price"]["current"]

Is it possible to scrape a "dynamical webpage" with beautifulsoup?

I am currently begining to use beautifulsoup to scrape websites, I think I got the basics even though I lack theoretical knowledge about webpages, I will do my best to formulate my question.
What I mean with dynamical webpage is the following: a site whose HTML changes based on user action, in my case its collapsible tables.
I want to obtain the data inside some "div" tag but when you load the page, the data seems unavalible in the html code, when you click on the table it expands, and the "class" of this "div" changes from something like "something blabla collapsible" to "something blabla collapsible active" and this I can scrape with my knowledge.
Can I get this data using beautifulsoup? In case I can't, I thought of using something like selenium to click on all the tables and then download the html, which I could scrape, is there an easier way?
Thank you very much.
It depends. If the data is already loaded when the page loads, then the data is available to scrape, it's just in a different element, or being hidden. If the click event triggers loading of the data in some way, then no, you will need Selenium or another headless browser to automate this.
Beautiful soup is only an HTML parser, so whatever data you get by requesting the page is the only data that beautiful soup can access.

Python - How to scrape Tr/Td table data using 'requests & BeautifulSoup'

I'm new to programming. I'm trying out my first Web Crawler program that will help me with my job. I'm trying to build a program that will scrape tr/td table data from a web page, but am having difficulties succeeding. Here is what I have so far:
import requests
from bs4 import BeautifulSoup
def start(url):
source_code = requests.get(url).text
soup = BeautifulSoup(source_code)
for table_data in soup.find_all('td', {'class': 'sorting_1'}):
print(table_data)
start('http://www.datatables.net/')
My goal is to print out each line and then export it to an excel file.
Thank you,
-Cire
My recommendation is that if you are new to Python, play with things via the iPython notebook (interactive prompt) to get things working first and to get a feel for things before you try writing a script or a function. On the plus side all variables will stick around and it is much easier to see what is going on.
From the screen shot here, you can see immediately that the find_all function is not finding anything. An empty lists [] is being returned. By using ipython you can easily try other variants of a function on a previously defined variable. For example, the soup.find_all('td').
Looking at the source of http://www.datatables.net, I do not see any instances of the text sorting_1, so I wouldn't expect a search for all table cells of that class to return anything.
Perhaps that class appeared on a different URL associated with the DataTables website, in which case you would need to use that URL in your code. It's also possible that that class only appears after certain JavaScript has been run client-side (i.e. after certain actions with the sample tables, perhaps), and not on the initially loaded page.
I'd recommend starting with tags you know are on the initial page (seen by looking at the page source in your browser).
For example, currently, I can see a div with class="content". So the find_all code could be changed to the following:
for table_data in soup.find_all('div', {'class': 'content'}):
print(table_data)
And that should find something.
Response to comments from OP:
The precise reason why you're not finding that tag/class pairing in this case is that DataTables renders the table client-side via JavaScript, generally after the DOM has finished loading (although it depends on the page and where the DataTables init code is placed). That means the HTML associated with the base URL does not contain this content. You can see this if you curl the base URL and look at the output.
However when loading it in a browser, once the JavaScript for DataTables fires, the table is rendered and the DOM is dynamically modified to add the table, including cells with the class for which you're looking.

Categories