Automate downloading images off Google - python

I'm very new to Python and I'm trying to create a tool that automates downloading images off Google.
So far, I have the following code:
import urllib
def google_image(x):
search = x.split()
search = '%20'.join(map(str, search))
url = 'http://ajax.googleapis.com/ajax/services/search/images?v=1.0&q=%s&safe=off' %
But I'm not sure where to continue or if I'm even on the right track. Can someone please help?

see scrapy documentation for image pipeline
ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1}

Related

Generalised Scraping embed YouTube videos with python

I just wrote a script that scrape all youtube links in a given page (this one : https://coreyms.com/) and open them in a browser (this has no particular use beside learning how to do it).
from requests_html import HTML, HTMLSession
import re
import webbrowser
url = input("website url to launch all embeded videos")
if url == '':
url = 'https://coreyms.com/'
sess = HTMLSession()
r = sess.get(url)
whole_file = r.text
pattern = re.compile(r'https:\/\/www\.youtube\.com\/embed\/(.+)[\?"]')
video_ids = pattern.findall(whole_file)
The problem is that it only works with that specific website (because I used the HTML source of it in order to know what my script must search).
I managed to make the script works with this as well
https://www.udiscovermusic.com/stories/best-heavy-metal-songs/
But with another one :
https://www.musicgrotto.com/best-classic-rock-songs/
It does not work again. I think this is not the good approach. \
How can I generalise my script in order to make it work with any website ?
Is there a common tread I could use ?
I don't know much of HTML and web developpement. (that's exactly why I want to do this)
Thanks

Saving image from API Endpoint with no filetype, in python

I'm trying to save images from the Spotify API
I get album art in the form of a link:
https://i.scdn.co/image/ab67616d00004851c96f7c7b077c224975b4c5ce
I think it's a jpg file.
I run into errors in trying to display or save this in python.
I'm not even sure how I'm meant to format something like:
Do I need str around the link?
str(https://i.scdn.co/image/ab67616d00004851c96f7c7b077c224975b4c5ce)
Or should I create a new variable e.g.
image_path = 'https://i.scdn.co/image/ab67616d00004851c96f7c7b077c224975b4c5ce'
And then:
im1 = im1.save(image_path)
Your second suggestion should work with an addition of actually downloading the image using urllib.request:
import urllib.request
image_path = 'https://i.scdn.co/image/ab67616d00004851c96f7c7b077c224975b4c5ce'
urllib.request.urlretrieve(image_path, "image.jpg")

Getting a URL of some picture from Google search

New to Python. I'm trying to find a way to get a url of the first picture I get from google search for some string. For example if I type "dog" I would like to get the first picture url for dog. I don't care which one just some url from Google image search.
Is it possible? what is the easiest way to do it? I saw from previous threads many ways to extract/download the image - but I just need the url and it doesn't matter which one.
This should work, simply replace the word to get images of anything.
Make sure you have requests and BeautifulSoup, if not run this command:
pip install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup
word = 'dog'
url = 'https://www.google.com/search?q={0}&tbm=isch'.format(word)
content = requests.get(url).content
soup = BeautifulSoup(content,'lxml')
images = soup.findAll('img')
for image in images:
print(image.get('src'))
I don't know about Google, but I do know an easy way to do this with Bing. There's a PyPI module called bing-image-urls (https://pypi.org/project/bing-image-urls) , and this will do the job nicely. It's pretty easy to use. Just install it with:
pip install bing-image-urls
Then, in your python script, have this code:
from bing_image_urls import bing_image_urls
url = bing_image_urls("dog", limit=1)[0]
print(url)
Just replace "dog" in this example with whatever you want.
Hopefully this answers your question
Thanks!

python newspaper module - get all the images from an article

By using newspaper module of python , I can get the top image from an article in the following way:
from newspaper import Article
first_article = Article(url="http://www.lemonde.fr/...", language='fr')
first_article.download()
first_article.parse()
print(first_article.top_image)
But I need to get all the images in the article. Their github documentation says : 'All image extraction from html' is possible. But I can't just figure that out. And i do no want to manually download and save the html files in hard drive and then feed the module with the files and get the images.
In what way can I achieve that ?
You likely solved this already, but you can obtain the image urls with Newspaper by calling article.images.
from newspaper import Article
article = Article(url="http://www.lemonde.fr/", language='fr')
article.download()
article.parse()
top_image = article.top_image
all_images = article.images
for image in all_images:
print(image)
https://img.lemde.fr/2020/09/22/0/3/4485/2990/220/146/30/0/a79897c_115736902-000-8pt8nc.jpg
https://img.lemde.fr/2020/09/22/0/0/5315/3543/192/0/75/0/7b90c88_645792534-pns-3418491.jpg
https://img.lemde.fr/2020/09/09/200/0/1500/999/180/0/95/0/d8099d2_51464-3185927.jpg
https://img.lemde.fr/2020/09/22/0/4/4248/2832/664/442/60/0/557e6ee_5375150-01-06.jpg

Beautiful Soup can not find all image tags in html (stops exactly at 5)

I am trying to use beautifulsoup to get all the images of a site with a certain class. my issue is that when i run the code just to see if my code can find each image it only gets images 1-5. I think the issue is the html since images 6-end is located in a nested div but Find_all should be able to find all the img with the same class.
import requests, os, bs4, sys, webbrowser
url = 'https://mangapanda.onl/chapter/'
os.makedirs('manga', exist_ok=True)
comic = sys.argv[1:]
aComic = '-'.join(sys.argv[1:])
issue = input('which issue do you want?')
aIssue = ('/chapter-' + issue)
aComic = (aComic + '_110' + aIssue)
comicUrl = (url + aComic)
res = requests.get(comicUrl)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
comicElem = soup.find_all(class_="PB0mN")
if comicElem == []:
print('nothing in the list')
else:
print('There are ' + str(len(comicElem)) + ' on this page')
for i in range(len(comicElem)):
comicPage = comicElem[i].get('src')
print(str(comicPage) + '\n')
is there something I am missing when it comes to using beautiful soup that could have helped me solve this? is it the html that is causing this problem? Was there a better way i could have diagnosis this problem myself that would have been in my realm of capability (side note: i am currently going through the book "Automating the Boring Stuff with Python". it is where i got the idea for this mini project and a decent idea of where my level of skill is with python. Lastly I am using BeautifulSoup to learn more about it. If possible i would like to solve this issue using BeautifulSoup will research other options of parsing through html if i need to.
Using firefox quantim 59.0.2
using python3
PS, if you know of other questions that may have answered this problem already feel free to just link me to it. I really wanted to just figure out the answer through someone else questions but it seems like my issue was pretty unique.
The problem is some of the images are being added to the DOM via Javascript after the page is loaded. So
res = requests.get(comicUrl)
gets the HTML and DOM before any modification are made by javascript. This is why
soup = bs4.BeautifulSoup(res.text, 'html.parser')
comicElem = soup.find_all(class_="PB0mN")
len(comicElem) # = 5
only finds 5 images.
If you want to get all the images that are loaded you cannot use the requests library. Here is an example using selenium:
from selenium import webdriver
browser = webdriver.Chrome('/Users/glenn/Downloads/chromedriver')
comicUrl = "https://mangapanda.onl/chapter/naruto_107/chapter-700.5"
browser.get(comicUrl)
images = browser.find_elements_by_class_name("PB0mN")
for image in images:
print(image.get_attribute('src'))
len(images) # = 18 images
See this post for additional resources for scraping javascript pages:
Web-scraping JavaScript page with Python
Regarding how to tell if the HTML is being modified using javascript?
I don't have any hard rules but these are some investigative steps you can carry out:
As you observed only finding 5 images originally with requests but seeing there are more images on the page is the first clue the DOM is being changed after it is loaded.
A second step: using the browser Developer Tools -> Scripts you can see there are several javascript files associated with the page. Note that not all javascript modify the DOM so the presence of these scripts does not necessarily mean they are modifying the DOM.
For further verification the DOM is being modified after the page is loaded:
Copy the html from Developer Tools -> View Page Source into an HTML formatter tool like http://htmlformatter.com, format the html and look at the line count. The Developer Tools -> View Page Source is the html that is sent by the server without any modifications.
Then copy the html from Developer Tools -> Elements (be sure to get the whole thing from <html>...</html>) and copy this into an HTML formatter tool like http://htmlformatter.com, format and look at the line count. The Developer Tools -> Elements html is the complete, modified DOM.
If the line counts are significantly different then you know the DOM is being modified after it is loaded.
Comparing line counts for "https://mangapanda.onl/chapter/naruto_107/chapter-700.5" shows 479 lines for the source html and 3245 lines for the complete DOM so you know something is modifying the DOM after the page is loaded.

Categories