Web Scraping Javascript Using Python - python

I am used to using BeautifulSoup to scrape a website, however this website is different. Upon soup.prettify() I get back Javascript code, lots of stuff. I want to scrape this website for the data on the actual website (company name, telephone number etc). Is there a way of scraping these scripts such as Main.js to retrieve the data that is displayed on the website to me?
Clear version:
Code is:
<script src="/docs/Main.js" type="text/javascript" language="javascript"></script>
This holds the text that is on the website. I would like to scrape this text however it is populated using JS not HTML (which I used to use BeautifulSoup for).

You're asking if you can scrape text generated at runtime by Javascript. The answer is sort-of.
You'd need to run some kind of headless browser, like PhantomJS, in order to let the Javascript execute and populate the page. You'd then need to feed the HTML that the headless browser generates to BeautifulSoup in order to parse it.

Related

HTML Parsing with Python (HTML vs. complete website)

I'm trying to parse html from a website that contains information about train tickets and there prices (source below), however I'm having an issue getting back all the html from the website when I use urllib to request the html.
What I need is the price per ticket which doesn't seem to appear when I used urllib to request the html. After doing some investigative work, I determined that if I save the webpage with chrome and select "HTML only", I don't get the price, however if I select "Complete WebPage," I do. Is there anyway to view the HTML that I get when I download the "Complete Webpage" and use that in python. Or is there a way to automate the downloading of the complete webpage and use the downloaded files to parse in python.
Thanks,
George
https://www.raileurope.com/en/us/point_to_point/ptp_results.htm?execution=e3s1&resultId=147840746&cobrand=public&saleCountry=us&resultId=147840746&cobrand=public&saleCountry=us&itemId=-1&fn=fsRequest&cobrand=public&c=USD&roundtrip=0&isAtocRequest=0&georequest=1&lang=en&route-type=0&from0=paris&to0=amsterdam&deptDate0=06%2F07%2F2017&time0=8&pass-question-radio=1&nCountries=&selCountry1=&selCountry2=&selCountry3=&selCountry4=&selCountry5=&familyId=&p=0&additionalTraveler0=adult&additionalTravelerAge0=&paxIds=&nA=1&nY=0&nC=0&nS=0
Take a look at selenium
Since the website is rendered by JS, you will have to use a webdriver to simulate the "Click".
You will need a crawler instead of a simple scraper

Scrape data from JavaScript-rendered website

I want to scrap Lulu webstore. I have the following problems with it.
The website content is loaded dynamically.
The website when tried to access, redirects to choose country page.
After choosing country, it pops up select delivery location and then redirects to home page.
When you try to hit end page programmatically, you get an empty response because the content is loaded dynamically.
I have a list of end URLs from which I have to scrape data. For example, consider mobile accessories. Now I want to
Get the HTML source of that page directly, which is loaded dynamically bypassing choose country, select location popups, so that I can use my Scrapy Xpath selectors to extract data.
If you suggest me to use Selenium, PhantomJS, Ghost or something else to deal with dynamic content, please understand that I want the end HTML source as in a web browser after processing all dynamic content which will be sent to Scrapy.
Also, I tried using proxies to skip choose country popup but still it loads it and select delivery location.
I've tried using Splash, but it returns me the source of choose country page.
At last I found answer. I used EditThisCookie plugin to view the cookies that are loaded by the Web Page. I found that it stores 3 cookies CurrencyCode,ServerId,Site_Config in my local storage. I used the above mentioned plugin to copy the cookies in JSON format. I referred this manual for setting cookies in the requests.
Now I'm able to skip those location,delivery address popups. After that I found that the dynamic pages are loaded via <script type=text/javascript> and found that part of page url is stored in a variable. I extracted the value using split(). Here is the script part to get the dynamic page url.
from lxml import html
page_source=requests.get(url,cookies=jar)
tree=html.fromstring(page_source.content)
dynamic_pg_link=tree.xpath('//div[#class="col3_T02"]/div/script/text()')[0] #entire javascript to load product pages
dynamic_pg_link=dynamic_pg_link.split("=")[1].split(";")[0].strip()#obtains the dynamic page url.
page_link="http://www.luluwebstore.com/Handler/ProductShowcaseHandler.ashx?ProductShowcaseInput="+dynamic_pg_link
Now I'm able to extract data from these LInks.
Thanks to #Cal Eliacheff for the previous guidance.

fetch text from web with Angular JS tags such as ng-view

I'm trying to fetch all the visible text from a website, I'm using python-scrapy for this work. However what i observe scrapy only works with HTML tags such as div,body,head etc. and not with angular js tags such as ng-view, if there is any element within ng-view tags and when I do a right-click on the page and do view source then the content inside the tag doesn't appear and it displays like <ng-view> </ng-view>, So how can I use python to scrap the elements within this ng-view tags.Thanks in advance..
To answer your question
how can I use python to scrap the elements within this ng-view tags
You can't.
The content you want to scrape renders on the client side(browser), what scrapy get's you is just static content from server, your browser than interprets the HTML code and renders the JS code. And JS code than fetches different content from server again and makes some stuff with it.
Can it be done?
Yes!
One of the ways is to use some sort oh headless browser like http://phantomjs.org/ to fetch all the content. Once you have the content you can save it and scrape it as you wish. The thing is that this kind of web scraping is not as easy and straight forward as just scraping regular HTML. There is a reason why Google still doesn't scrape web pages that render their content via JS.

omegle lxml scrape not working

So I'm performing a scrape of omegle trying to scrape the users online.
This is the HTML code:
<div id="onlinecount">
<strong>
30,000+
</strong>
</div>
Now I would presume that using LXML it would be //div[#id="onlinecount"] to scrape any text within the , I want to get the numbers from the tags, but when I try to scrape this, I just end up with an empty list
Here's my relevant code:
print "\n Grabbing users online now from",self.website
site = requests.get(self.website)
tree = html.fromstring(site.text)
users = tree.xpath('//div[#id="onlinecount"]')
Note that the self.website variable is just http://www.omegle.com
Any ideas what I'm doing wrong? Note I can scrape other parts just not the number of online users.
I ended up using a different set of code which I learned from a friend.
Here's my full code for anyone interested.
http://pastebin.com/u1kTLZtJ
When you send a GET request to "http://www.omegle.com" using requests python module,what I observed is that there is no "onlinecount" in site.text. The reason is that part gets rendered by a javascript. You should use a library that is able to execute the javascript and give you the final html source that is rendered in a browser. One such third party library is Selenium http://selenium-python.readthedocs.org/. The only downside is that it opens a real web browser.
Below is a working code using selenium and an attached screenshot:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://www.omegle.com")
element = browser.find_element_by_id("onlinecount")
onlinecount = element.find_element_by_tag_name("strong")
You can also use GET method on this http://front1.omegle.com/status
that will return the count of online users and other details in JSON form
I have done a bit of looking at this and that particular part of the page is not XML but Javascript.
Here is the source (this is what the requests library is returning in your program)
<div id="onlinecount"></div>
<script>
if (IS_MOBILE) {
$('sharebuttons').dispose();
$('onlinecount').dispose();
}
</script>
</div>
As you can see, in lxml's eyes there is nothing but a script in the onlinecount div.
I agree with Praveen.
If you want to avoid launching a visible browser, you could use PhantomJS which also has a selenium driver :
http://phantomjs.org/
PhantomJS is a headless WebKit scriptable with a JavaScript API
Instead of selenium scripts, you could also write PhantomJS js scripts (but I assume you prefer to stay in Python env ;))

Using BeautifulSoup to call a JAVA function

I am trying to scrape some data from the following website
http://www.pro-football-reference.com/teams/crd/2000_roster.htm
In particular, I want to scrape the data in the roster table. There is a red link at the heading of the table named "CSV" and if you click on it, the page loads the table information in csv format. The HTML code of this link is
<span tip="Get a widget to embed this table on your site" class="tooltip" onclick="sr_display_embed(this,'games_played_team'); try { pageTracker._trackEvent('Tool','Action','Embed'); } catch (err) {}">Embed</span>
I assume the function table2csv() is what is being executed. I don't have any experience with web development so I'm not even sure what this function is, I'm assuming it's JAVA. I'm looking for some guidance on how I can use BeautifulSoup to automate executing this function and then scraping the text in the HTML parse tree that appears after the function executes. Thank you.
The code that the page execute is JavaScript more specific AJAX, I recommend you use Selenium to make this work, mainly because this up a browser and with this you can make a program to make a click in this link and load the AJAX call and then scrap the content. This is the more accurate solution. Selenium is available for a lot of languages like JAVA, C#, Python, etc.
If you don't want to use Selenium instead you can see the XHTML request browser do and obtain directly the CSV, I think. You can see this using Chrome pressing F12 for view the developer tool or installing Firebug for Firefox, all in the tag network.
I am not familiar with BeautifulSoup and know very little Python, but I have dabbled in trying to scrape profootball reference in java and JSoup and then later HtmlUnit...
JSoup, and likely BeautifulSoup (as they are similar according to my recent google search), are not designed to invoke javascript functions.
Additionally, the page does not invoke a network request when the CSV link is invoked. Therefore, there is no known url that can be invoked to obtain the data in CSV format. The table2csv function in javascript creates the csv data from the html table data.
Your best option is to do as the javascript table2csv function does. Take the table data, obtainable via BeautifulSoup, and parse that directly.

Categories