Webscraping: book page images inside reader? - python

The image below shows source code, I came across while web scraping. The link to website is URL. It is basically online book reader which shows rendered images of books. What's so weird is I am unable to find any "img" or "src" tags or any url to these images. Any hint how to scrap this.

Related

Download video with blob url

I am working on a Python project to scrap videos from Instagram using Selenium and request.
I am following the following link but it seems Instagram changed its settings:
https://www.youtube.com/watch?v=3DCtaJvf6VA&list=PLEsfXFp6DpzQjDBvhNy5YbaBx9j-ZsUe6&index=17&ab_channel=CodingEntrepreneurs
However, after I get a link of Instagram video, it's like this:
blob:https://www.instagram.com/4ddaf674-312a-4366-ad12-136bda7b6c8e
Hence, I cannot download the video. And I have looked at similar posts and they can find m3u8 or .ts in the Inspection. However, I cannot find any of them. Could anyone help?

Requests.get not saving images on the webpage

when I right click a page and click save as and save it as html, it saves everything on the page, including images. However, when I use python's requests.get, it saves the html page without saving the images. They appear as broken links, like so:
Broken images
Instead of like so:
Working images
How can I get requests.get to get all the data on the webpage? Appreciate any advice.
Edit: This is the code I'm using to scrape the website:
for link in aosa1:
res=requests.get("http://aosabook.org/en/"+link)
print(res.status_code)
print(len(res.text))
res.raise_for_status()
playfile=open(link,'wb')
for chunk in res.iter_content(100000):
playfile.write(chunk)
playfile.close
You don't understand how HTML works. Those images are not part of the page. When a browser downloads an HTML file, it then scans the HTML looking for <img> tags. For each <img> tag, it make a new HTML request to fetch that so it can display it. Now, if the <img> tags had absolute URLs, it would still show for you. But if they have relative URLs (<img src="/images/abc.png">), then the browser is going to try to fetch them from your localhost web server, which does not exist. You can try to scan the HTML and fetch those images.

Python: Scrape videos (that are invisible in HTML) from webpages

I would like to download the videos embedded in a set of URL's from the Google Ad Transparency report. Here's a sample page where the video is a YouTube link:
https://transparencyreport.google.com/political-ads/advertiser/AR113835462480625664/creative/CR525481620803682304
And here's a sample page where the video is hosted by google and the video file can be directly downloaded:
https://transparencyreport.google.com/political-ads/advertiser/AR528016269983612928/creative/CR221377870159675392
In both cases, regular browsers (Chrome, Firefox) let me copy the YouTube link URL (top example) or download the linked video file (bottom example).
However, I cannot locate these links in the page source. Can anyone tell me how to locate them, or how one would write a script that would locate the correct tags and grab the video files (or YouTube links)? Is this a dynamic content problem?
You could reach it by the transparencryreport API
For example, the page you attached is getting the video content from the following page:
https://transparencyreport.google.com/transparencyreport/api/v3/politicalads/creatives/details?entity_id=AR528016269983612928&creative_id=CR221377870159675392

how to extract youtube thumbnail from youtube link in python

I want to extract and display Youtube search results for a query to the user.
In that process, I have completed fetching the Youtube link and also extracted the title from the link.
Nevertheless I also want the thumbnail of that link displayed, same as that displayed in Youtube suggestions section.
For a question like this, I'd recommend using the site:youtube.com Google Images search, and just have a look at one or two thumbnails. I believe the below should work in all cases, though you'd need to test on different types of videos.
If the video URL is https://www.youtube.com/watch?v=xxxxxxxxxxxx
The thumbnail URL is https://i.ytimg.com/vi/xxxxxxxxxxxx/maxresdefault.jpg

Scrape PDFs inside viewer frame

(Complete begginer in web scraping here)
I'm trying to scrape the PDF from this webpage using python:
http://pesquisa.in.gov.br/imprensa/jsp/visualiza/index.jsp?jornal=3&pagina=1&data=31/03/1993
The problem is that the above URL points to the viewer (with date-page parameters), not the PDF file. I tried to inspect the html code to see the URL to the PDF directly, but could not.
any help on how to find the correct URL and implement a way to download them in python?
Edit:
I will later generalize this to other days and pages, the full list of day-page links can be found by searching for the relevant period here: http://portal.imprensanacional.gov.br/

Categories