Scraping SVG charts - python

I am trying to scrape the following svg's from the following link:
https://finance.yahoo.com/quote/AAPL/analysts?p=AAPL
The portion I am trying to scrape is as follows:
Images Here
I do not need the words of the chart (just the graphs themselves). However, I have never scraped an svg image before and i'm not sure if it is possible. I looked around but could not find any useful python packages to directly do this.
I know that I can take a screenshot of the image with python using selenium and then use PIL to crop it and save it as an svg, but I am wondering if there is a more direct way to grab these charts off the page. Any useful packages or implementations would be helpful. Thank you.
Edit: Got some down votes but not sure why Here is how I would implement it in my way..
import sys
import time
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import *
class Screenshot(QWebView):
def __init__(self):
self.app = QApplication(sys.argv)
QWebView.__init__(self)
self._loaded = False
self.loadFinished.connect(self._loadFinished)
def capture(self, url, output_file):
self.load(QUrl(url))
self.wait_load()
# set to webpage size
frame = self.page().mainFrame()
self.page().setViewportSize(frame.contentsSize())
# render image
image = QImage(self.page().viewportSize(), QImage.Format_ARGB32)
painter = QPainter(image)
frame.render(painter)
painter.end()
print 'saving', output_file
image.save(output_file)
def wait_load(self, delay=0):
# process app events until page loaded
while not self._loaded:
self.app.processEvents()
time.sleep(delay)
self._loaded = False
def _loadFinished(self, result):
self._loaded = True
s = Screenshot()
s.capture('https://finance.yahoo.com/quote/AAPL/analysts?p=AAPL', 'yhf.png')
I would then use the crop function in PIL to take the images out of the charts.

Using QWebView for web scraping seams weird to me, although I do realize that there is an advantage that it says to the server "I'm not a web scraper, I'm an embeded browser". Note that this approach is not bulletproof: your scraper can still be detected if it shows a behavior unusual for a human user.
This is how I would do it:
Id use requests to download the page (may be through a proxy that hides your real ip addres to combat ip-bans).
Then I'd parse the page using BeautifulSoup to get the url of the svg file you are trying to get.
Then I'd download the svg file and convert it into an image using something like this
If you want to continue using Qt instead, look for methods in the web view that allow inspecting DOM or extracting the resources the view downloaded.

Related

How to get URL of media clicked on in a QWebEnginePage, and all other media links on the page

I can't compile my own Qt with proprietary codecs, let alone then compiling PyQt5 from this Qt build, so I have a workaround idea in order to be able to display these kind of media files ie h.264 on my browser (I assume I will still need to get licences etc). What I have thought of is to get all of the media links that are on the webpage, and put them into a QMediaPlaylist, and then select the correct video when one is clicked on the webpage.
from PyQt5.QtCore import *
from PyQt5.QtWidgets import *
from PyQt5.QtGui import *
from PyQt5.QtMultimedia import *
from PyQt5.QtMultimediaWidgets import *
import sys
class Player(QMainWindow):
def __init__(self,parent=None,*args,**kwargs):
QMainWindow.__init__(self,parent,*args,**kwargs)
self._player = QMediaPlayer(self)
self._player.setAudioRole(QAudio.VideoRole)
self._playlist = QMediaPlaylist()
self._playlist.addMedia(QMediaContent(QUrl("http://techslides.com/demos/sample-videos/small.mp4")))
self._player.setPlaylist(self._playlist)
self._videoWidget = VideoWidget(self)
self._player.setVideoOutput(self._videoWidget)
self._playlist.setCurrentIndex(0)
self._player.play()
self.setCentralWidget(self._videoWidget)
self.showMaximized()
class VideoWidget(QVideoWidget):
def __init__(self,parent,*args,**kwargs):
QVideoWidget.__init__(self,parent,*args,**kwargs)
def keyPressEvent(self,event):
if event.key() == Qt.Key_Escape and self.isFullScreen():
self.setFullScreen(False)
event.accept()
elif event.key() == Qt.Key_Enter and event.modifiers() == Qt.Key_Alt:
self.setFullScreen(not self.isFullScreen())
event.accept()
else:
super().keyPressEvent(event)
def mouseDoubleClickEvent(event):
self.setFullScreen(not self.isFullScreen())
event.accept()
def exceptHook(self,e,v,t):
sys.__excepthook__(e,v,t)
sys.excepthook = exceptHook
app = QApplication(sys.argv)
mainWindow = Player()
app.exec_()
In this example, there is only one .mp4 link in the media playlist for testing, where-as for my browser I would like to be able to add the URLs of all the media files of the webpage. When a video is clicked on the webpage, I would like to be able to locate the URL in the playlist, switch to the corresponding index, and then overlay the video widget onto the webpage, and play it. I don't know how to:
Get all the media links of the webpage (not all are intercepted in QWebEngineUrlRequestInterceptor as some links are stored in tags in the mainframe website's html code)
Get the clicked media's URL (need some sort of signal or event maybe?)
Get a video's coordinate location on the page (and update it for scrolling etc)
I have tried using the triggerAction and intercepting the events, but there aren't any actions being made that allow me to get the clicked videos URL. I suppose I could get the webpage html with .toHtml(callback), and maybe get the links from there by filtering with beautifulSoup4, but the .toHtml call is quite slow in my experience. I assume there is some Javascript I could run to get the video's coordinate locations (and maybe links) but I have no experience with Javascript.

How to get the html dom of a webpage and its frames

I would like to get the DOM of a website after js execution.
I would also like to get all the content of the iframes in the website, similarly to what I have in Google Chrome's Inspect Element feature.
This is my code:
import sys
from PyQt4 import QtGui, QtCore, QtWebKit
class Sp():
def save(self):
print ("call")
data = self.webView.page().currentFrame().documentElement().toInnerXml()
print(data.encode('utf-8'))
print ('finished')
def main(self):
self.webView = QtWebKit.QWebView()
self.webView.load(QtCore.QUrl("http://www.w3schools.com/tags/tryit.asp?filename=tryhtml_iframe_scrolling"))
QtCore.QObject.connect(self.webView,QtCore.SIGNAL("loadFinished(bool)"),self.save)
app = QtGui.QApplication(sys.argv)
s = Sp()
s.main()
sys.exit(app.exec_())
This gives me the html of the website, but not the html inside the iframes. Is there any way that I could get the HTML of the iframes.
This is a very hard problem to solve in general.
The main difficulty is that there is no way to know in advance how many frames each page has. And in addition to that, each child-frame may have its own set of frames, the number of which is also unknown. In theory, there could be an infinite number of nested frames, and the page will never finish loading (which seems no exaggeration for sites that have a lot of ads).
Anyway, below is a version of your script which gets the top-level QWebFrame object of each frame as it loads, and shows how you can access some of the things you are interested in. As you will see from the output, there are a lot of "junk" frames inserted by ads and such like that you will somehow need to filter out.
import sys, signal
from PyQt4 import QtGui, QtCore, QtWebKit
class Sp():
def save(self, ok, frame=None):
if frame is None:
print ('main-frame')
frame = self.webView.page().mainFrame()
else:
print('child-frame')
print('URL: %s' % frame.baseUrl().toString())
print('METADATA: %s' % frame.metaData())
print('TAG: %s' % frame.documentElement().tagName())
print()
def handleFrameCreated(self, frame):
frame.loadFinished.connect(lambda: self.save(True, frame=frame))
def main(self):
self.webView = QtWebKit.QWebView()
self.webView.page().frameCreated.connect(self.handleFrameCreated)
self.webView.page().mainFrame().loadFinished.connect(self.save)
self.webView.load(QtCore.QUrl("http://www.w3schools.com/tags/tryit.asp?filename=tryhtml_iframe_scrolling"))
signal.signal(signal.SIGINT, signal.SIG_DFL)
print('Press Crtl+C to quit\n')
app = QtGui.QApplication(sys.argv)
s = Sp()
s.main()
sys.exit(app.exec_())
NB: it is important that you connect to the loadFinished signal of the main frame rather than the web-view. If you connect to the latter, it will be called multiple times if the page contains more than one frame.

QWebView get response

I have a python code with PySide that has a QWebView that shows google maps.
I just want to get the response each time that I do any request using the QWebView widget.
I have searched info but there is no reference about getting a response with PySide. If you need me to paste some code I will but I just have a simple QWebView widget.
EDIT: You asked me for the code:
from PySide.QtCore import *
from PySide.QtGui import *
import sys
import pyside3
class MainDialog(QMainWindow, pyside3.Ui_MainWindow):
def __init__(self, parent=None):
super(MainDialog,self).__init__(parent)
self.setupUi(self)
token_fb=""
#self.Connect_buttom.clicked.connect(self.get_fb_token)
self.Connect_buttom.clicked.connect(lambda: self.get_fb_token(self.FB_username.text(), self.FB_password.text()))
#self.connect(self.Connect_buttom, SIGNAL("clicked()"), self.get_fb_token)
#Change between locate and hunt
self.MapsButton.clicked.connect(lambda: self.select_page_index(0))
self.HuntButton.clicked.connect(lambda: self.select_page_index(1))
###########################
self.webView.setHtml(URL)
def select_page_index(self, index): # To change between frames
self.Container.setCurrentIndex(index)
I need the response from: self.webView.setHtml(URL) because depending on the response my app has to do one thing or other.
Function QWebView.setHtml() has no response in the sense that it doesn't return anything.
Maybe you want to listen to all links that are clicked and do something custom with it.
web_view = QtWebKit.QWebView()
web_view.page().setLinkDelegationPolicy(QtWebKit.QWebPage.DelegateAllLinks)
web_view.linkClicked.connect(your_handler)
Or maybe you want to do something when loading has finished. This is done by:
web_view = QtWebKit.QWebView()
web_view.loadFinished.connect(your_handler)

PyQt wait until page has loaded

I want to save a page content to an image when it is fully loaded but sometimes i am getting output raster not rendered completely.
Code:
import sys
import signal
import os
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import QWebPage
app = QApplication(sys.argv)
signal.signal(signal.SIGINT, signal.SIG_DFL)
webpage = QWebPage()
def onLoadFinished(result):
if not result:
print "Request failed"
sys.exit(1)
webpage.setViewportSize(webpage.mainFrame().contentsSize())
image = QImage(webpage.viewportSize(), QImage.Format_ARGB32)
painter = QPainter(image)
webpage.mainFrame().render(painter)
painter.end()
if os.path.exists("output.png"):
os.remove("output.png")
image.save("output.png")
sys.exit(0) # quit this application
webpage.mainFrame().load(QUrl("file:///page.html"))
webpage.connect(webpage, SIGNAL("loadFinished(bool)"), onLoadFinished)
sys.exit(app.exec_())
Page is using JavaScript (onload function) to acquire google map (640x640px) .
Image: http://i56.tinypic.com/15ojg3s.png
I'm not sure if this even possible. For a static website this could probably work, but Google Maps will load tiles dynamically, and I'm in doubt it will emit a usuable "I'm done" signal.
But it seems you only want an image of a Google map? Have you looked at their API? They allow you to generate static maps, just by building a URL.
Example
http://maps.google.com/maps/api/staticmap?center=Brooklyn+Bridge,New+York,NY&zoom=14&size=512x512&maptype=roadmap &markers=color:blue|label:S|40.702147,-74.015794&markers=color:green|label:G|40.711614,-74.012318 &markers=color:red|color:red|label:C|40.718217,-73.998284&sensor=false

PyQt QWebKit frame bug?

I'm using Python, PyQt4, and QtWebKit to load a web page into a bare-bones browser to examine the data.
However, there is a small issue. I'm trying to get the contents and src of every iframe on the loaded page. I'm using webView.page().mainFrame().childFrames() to get the frames. To problem is, childFrames() loads the frames ONLY if they're visible by the browser. For example, when your browser is positioned at the top of the page, childFrames() will not load the iframes are at the footer of the page. Is there a way or setting I could tweak where I can get all ads? I've attached the source of my "browser". Try scrolling down when the page finishes it's loading. Watch the console and you will see that the iframes load dynamically. Please help.
from PyQt4 import QtGui, QtCore, QtWebKit
import sys
import unicodedata
class Sp():
def Main(self):
self.webView = QtWebKit.QWebView()
self.webView.load(QtCore.QUrl("http://www.msnbc.msn.com/id/41197838/ns/us_news-environment/"))
self.webView.show()
QtCore.QObject.connect(self.webView,QtCore.SIGNAL("loadFinished(bool)"),self.Load)
def Load(self):
frame = self.webView.page().mainFrame()
children = frame.childFrames()
fT = []
for x in children:
print "=========================================="
print unicodedata.normalize('NFKD', unicode(x.url().toString())).encode('ascii','ignore')
print "=========================================="
fT.append([unicode(x.url().toString()),unicode(x.toHtml()),[]])
for x in range(len(fT)):
f = children[x]
tl = []
for fx in f.childFrames():
print "___________________________________________"
print unicodedata.normalize('NFKD', unicode(fx.url().toString())).encode('ascii','ignore')
print "___________________________________________"
tl.append([unicode(fx.url().toString()),unicode(fx.toHtml()),[]])
fT[x][2] = tl
app = QtGui.QApplication(sys.argv)
s = Sp()
s.Main()
app.exec_()
Not sure why you're doing what you're doing, but if it's only loading what's visible, you can set the page viewport size to the content size and that should load everything:
def Load(self):
self.webView.page().setViewportSize(
self.webView.page().mainFrame().contentsSize())
However, this has a weird effect in the GUI so this solution may be unacceptable for what you are trying to do.

Categories