The image below shows source code, I came across while web scraping. The link to website is URL. It is basically online book reader which shows rendered images of books. What's so weird is I am unable to find any "img" or "src" tags or any url to these images. Any hint how to scrap this.
Related
I am working on a Python project to scrap videos from Instagram using Selenium and request.
I am following the following link but it seems Instagram changed its settings:
https://www.youtube.com/watch?v=3DCtaJvf6VA&list=PLEsfXFp6DpzQjDBvhNy5YbaBx9j-ZsUe6&index=17&ab_channel=CodingEntrepreneurs
However, after I get a link of Instagram video, it's like this:
blob:https://www.instagram.com/4ddaf674-312a-4366-ad12-136bda7b6c8e
Hence, I cannot download the video. And I have looked at similar posts and they can find m3u8 or .ts in the Inspection. However, I cannot find any of them. Could anyone help?
when I right click a page and click save as and save it as html, it saves everything on the page, including images. However, when I use python's requests.get, it saves the html page without saving the images. They appear as broken links, like so:
Broken images
Instead of like so:
Working images
How can I get requests.get to get all the data on the webpage? Appreciate any advice.
Edit: This is the code I'm using to scrape the website:
for link in aosa1:
res=requests.get("http://aosabook.org/en/"+link)
print(res.status_code)
print(len(res.text))
res.raise_for_status()
playfile=open(link,'wb')
for chunk in res.iter_content(100000):
playfile.write(chunk)
playfile.close
You don't understand how HTML works. Those images are not part of the page. When a browser downloads an HTML file, it then scans the HTML looking for <img> tags. For each <img> tag, it make a new HTML request to fetch that so it can display it. Now, if the <img> tags had absolute URLs, it would still show for you. But if they have relative URLs (<img src="/images/abc.png">), then the browser is going to try to fetch them from your localhost web server, which does not exist. You can try to scan the HTML and fetch those images.
I would like to download the videos embedded in a set of URL's from the Google Ad Transparency report. Here's a sample page where the video is a YouTube link:
https://transparencyreport.google.com/political-ads/advertiser/AR113835462480625664/creative/CR525481620803682304
And here's a sample page where the video is hosted by google and the video file can be directly downloaded:
https://transparencyreport.google.com/political-ads/advertiser/AR528016269983612928/creative/CR221377870159675392
In both cases, regular browsers (Chrome, Firefox) let me copy the YouTube link URL (top example) or download the linked video file (bottom example).
However, I cannot locate these links in the page source. Can anyone tell me how to locate them, or how one would write a script that would locate the correct tags and grab the video files (or YouTube links)? Is this a dynamic content problem?
You could reach it by the transparencryreport API
For example, the page you attached is getting the video content from the following page:
https://transparencyreport.google.com/transparencyreport/api/v3/politicalads/creatives/details?entity_id=AR528016269983612928&creative_id=CR221377870159675392
I want to extract and display Youtube search results for a query to the user.
In that process, I have completed fetching the Youtube link and also extracted the title from the link.
Nevertheless I also want the thumbnail of that link displayed, same as that displayed in Youtube suggestions section.
For a question like this, I'd recommend using the site:youtube.com Google Images search, and just have a look at one or two thumbnails. I believe the below should work in all cases, though you'd need to test on different types of videos.
If the video URL is https://www.youtube.com/watch?v=xxxxxxxxxxxx
The thumbnail URL is https://i.ytimg.com/vi/xxxxxxxxxxxx/maxresdefault.jpg
(Complete begginer in web scraping here)
I'm trying to scrape the PDF from this webpage using python:
http://pesquisa.in.gov.br/imprensa/jsp/visualiza/index.jsp?jornal=3&pagina=1&data=31/03/1993
The problem is that the above URL points to the viewer (with date-page parameters), not the PDF file. I tried to inspect the html code to see the URL to the PDF directly, but could not.
any help on how to find the correct URL and implement a way to download them in python?
Edit:
I will later generalize this to other days and pages, the full list of day-page links can be found by searching for the relevant period here: http://portal.imprensanacional.gov.br/