I want to make a python bot that can interact with Symbolab. Here is an example. I have tried using the requests library and an example of the HCTI library to render the page as an image. Whenever I do this, the page looses its formatting. I am new to web scraping but I presume this is due to the css not being rendered as I was just grabbing the html. Is there I way that I can save an image file of a site like Symbolab in a way that renders the page like a web browser (all of the equations are readable etc)?
You are correct that the css is not rendered. When you use the requests library, you just get what you get for. If you look at symbolab's page, their css is found in <link href="/public/auto/main.min.css?110025" rel="Stylesheet" type="text/css"> inside the head of the html of the page.
If you want to use HCTI (which I assume is https://htmlcsstoimage.com/?), it looks like they accept an html parameter as well as a separate css parameter. So you could just have another request to https://www.symbolab.com/public/auto/main.min.css?110025 to get the CSS and use that with HCTI.
But this is only assuming there is no other CSS reference on their page and that this URL doesn't become invalid. To resolve this, you could scrape the html you received for CSS references and always get the most up to date links.
An easier solution might be to just use Selenium to programmatically control a browser, which will do all the rendering like if you were on a regular browser. Then you can take a screenshot of the page using Selenium still. Or even a picture of a specific element. See this answer
Hope this helps.
Related
first time poster here.
I am just getting into python and coding in general and I am looking into the requests and BeutifulSoup libraries. I am trying to grab image url’s from google images. When inspecting the site in chrome i can find the “div” and the correct img src url. But when I open the HTML that “requests” gives me I can find the same “div” but the img src url is something completely different and only leads to a black page if used.
Img of the HTML requests get
Img of the HTML found in chrome's inspect tool
What I wonder, and want to understand is:
why are these HTML's different
How do I get the img src that is found with the inspect tool with requests?
Hope the question makes sense and thank you in advance for any help!
Maybe differences between the the response HTML and the code in chrome inspector stems for updates to the page when JS changes it . for example when you use innerHTML() to edit div element so the code you add will add to DOM stack so as the code in the inspector but it would have no influence on the response.
You may search the http:// in the begging and the .png or .jpg or any other image format in the end.
Simply put, your code retrieves a single HTML page, and lets you access it, as it was retrieved. The browser, on the other hand, retrieves that HTML, but then lets the scripts embedded in (or linked from) it run, and these scripts often make significant modifications to the HTML (also known as DOM - Document Object Model). The browser's inspector inspects the fully modified DOM.
I'm working on a scraper project and one of the goals is to get every image link from HTML & CSS of a website. I was using BeautifulSoup & TinyCSS to do that but now I'd like to switch everything on Selenium as I can load the JS.
I can't find in the doc a way to target some CSS parameters without having to know the tag/id/class. I can get the images from the HTML easily but I need to target every "background-image" parameter from the CSS in order to get the URL from it.
ex: background-image: url("paper.gif");
Is there a way to do it or should I loop into each element and check the corresponding CSS (which would be time-consuming)?
You can grab all the Style tags and parse them, searching what you look.
Also you can download the css file, using the resource URL and parse them.
Also you can create a XPATH/CSS rule for searching nodes that contain the parameter that you're looking for.
The homepage of the website I'm trying to scrape displays four tabs, one of which reads "[Number] Available Jobs". I'm interested in scraping the [Number] value. When I inspect the page in Chrome, I can see the value enclosed within a <span> tag.
However, there is nothing enclosed in that <span> tag when I view the page source directly. I was planning on using the Python requests module to make an HTTP GET request and then use regex to capture the value from the returned content. This is obviously not possible if the content doesn't contain the number I need.
My questions are:
What is happening here? How can a value be dynamically loaded into a
page, displayed, and then not appear within the HTML source?
If the value doesn't appear in the page source, what can I do to
reach it?
If the content doesn't appear in the page source then it is probably generated using javascript. For example the site might have a REST API that lists jobs, and the Javascript code could request the jobs from the API and use it to create the node in the DOM and attach it to the available jobs. That's just one possibility.
One way to scrap this information is to figure out how that javascript works and make your python scraper do the same thing (for example, if there is a simple REST API it is using, you just need to make a request to that same URL). Often that is not so easy, so another alternative is to do your scraping using a javascript capable browser like selenium.
One final thing I want to mention is that regular expressions are a fragile way to parse HTML, you should generally prefer to use a library like BeautifulSoup.
1.A value can be loaded dynamically with ajax, ajax loads asynchronously that means that the rest of the site does not wait for ajax to be rendered, that's why when you get the DOM the elements loaded with ajax does not appear in it.
2.For scraping dynamic content you should use selenium, here a tutorial
for data that load dynamically you should look for an xhr request in the networks and if you can make that data productive for you than voila!!
you can you phantom js, it's a headless browser and it captures the html of that page with the dynamically loaded content.
I'm trying to fetch all the visible text from a website, I'm using python-scrapy for this work. However what i observe scrapy only works with HTML tags such as div,body,head etc. and not with angular js tags such as ng-view, if there is any element within ng-view tags and when I do a right-click on the page and do view source then the content inside the tag doesn't appear and it displays like <ng-view> </ng-view>, So how can I use python to scrap the elements within this ng-view tags.Thanks in advance..
To answer your question
how can I use python to scrap the elements within this ng-view tags
You can't.
The content you want to scrape renders on the client side(browser), what scrapy get's you is just static content from server, your browser than interprets the HTML code and renders the JS code. And JS code than fetches different content from server again and makes some stuff with it.
Can it be done?
Yes!
One of the ways is to use some sort oh headless browser like http://phantomjs.org/ to fetch all the content. Once you have the content you can save it and scrape it as you wish. The thing is that this kind of web scraping is not as easy and straight forward as just scraping regular HTML. There is a reason why Google still doesn't scrape web pages that render their content via JS.
How would I go about linking css and images to a template without routing it through bottle (#route('/image/') or #route('/css/')) and using a static_file return? because i am unable to link css normally (it cant find the css/image) and if i do it through static_file anyone can go to that link and view the css/image (IE www.mysite.com/css/css.css or www.mysite.com/image/image.png). Is there any way to get around this issue?
In order for a webbrowser to be able to download and render the css or image, it will either have to be part of your page (where people can view it by viewing the source of the page) or accessible at a URL.
So if you're trying to get around people being able to look at just your css or just your image, the answer is that there's no way around it.
see how to route to static files in bottle in the documentation, here: http://bottlepy.org/docs/dev/tutorial.html#tutorial-static-files