Getting a URL of some picture from Google search - python

New to Python. I'm trying to find a way to get a url of the first picture I get from google search for some string. For example if I type "dog" I would like to get the first picture url for dog. I don't care which one just some url from Google image search.
Is it possible? what is the easiest way to do it? I saw from previous threads many ways to extract/download the image - but I just need the url and it doesn't matter which one.

This should work, simply replace the word to get images of anything.
Make sure you have requests and BeautifulSoup, if not run this command:
pip install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup
word = 'dog'
url = 'https://www.google.com/search?q={0}&tbm=isch'.format(word)
content = requests.get(url).content
soup = BeautifulSoup(content,'lxml')
images = soup.findAll('img')
for image in images:
print(image.get('src'))

I don't know about Google, but I do know an easy way to do this with Bing. There's a PyPI module called bing-image-urls (https://pypi.org/project/bing-image-urls) , and this will do the job nicely. It's pretty easy to use. Just install it with:
pip install bing-image-urls
Then, in your python script, have this code:
from bing_image_urls import bing_image_urls
url = bing_image_urls("dog", limit=1)[0]
print(url)
Just replace "dog" in this example with whatever you want.
Hopefully this answers your question
Thanks!

Related

Generalised Scraping embed YouTube videos with python

I just wrote a script that scrape all youtube links in a given page (this one : https://coreyms.com/) and open them in a browser (this has no particular use beside learning how to do it).
from requests_html import HTML, HTMLSession
import re
import webbrowser
url = input("website url to launch all embeded videos")
if url == '':
url = 'https://coreyms.com/'
sess = HTMLSession()
r = sess.get(url)
whole_file = r.text
pattern = re.compile(r'https:\/\/www\.youtube\.com\/embed\/(.+)[\?"]')
video_ids = pattern.findall(whole_file)
The problem is that it only works with that specific website (because I used the HTML source of it in order to know what my script must search).
I managed to make the script works with this as well
https://www.udiscovermusic.com/stories/best-heavy-metal-songs/
But with another one :
https://www.musicgrotto.com/best-classic-rock-songs/
It does not work again. I think this is not the good approach. \
How can I generalise my script in order to make it work with any website ?
Is there a common tread I could use ?
I don't know much of HTML and web developpement. (that's exactly why I want to do this)
Thanks

Beautiful Soup can not find all image tags in html (stops exactly at 5)

I am trying to use beautifulsoup to get all the images of a site with a certain class. my issue is that when i run the code just to see if my code can find each image it only gets images 1-5. I think the issue is the html since images 6-end is located in a nested div but Find_all should be able to find all the img with the same class.
import requests, os, bs4, sys, webbrowser
url = 'https://mangapanda.onl/chapter/'
os.makedirs('manga', exist_ok=True)
comic = sys.argv[1:]
aComic = '-'.join(sys.argv[1:])
issue = input('which issue do you want?')
aIssue = ('/chapter-' + issue)
aComic = (aComic + '_110' + aIssue)
comicUrl = (url + aComic)
res = requests.get(comicUrl)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
comicElem = soup.find_all(class_="PB0mN")
if comicElem == []:
print('nothing in the list')
else:
print('There are ' + str(len(comicElem)) + ' on this page')
for i in range(len(comicElem)):
comicPage = comicElem[i].get('src')
print(str(comicPage) + '\n')
is there something I am missing when it comes to using beautiful soup that could have helped me solve this? is it the html that is causing this problem? Was there a better way i could have diagnosis this problem myself that would have been in my realm of capability (side note: i am currently going through the book "Automating the Boring Stuff with Python". it is where i got the idea for this mini project and a decent idea of where my level of skill is with python. Lastly I am using BeautifulSoup to learn more about it. If possible i would like to solve this issue using BeautifulSoup will research other options of parsing through html if i need to.
Using firefox quantim 59.0.2
using python3
PS, if you know of other questions that may have answered this problem already feel free to just link me to it. I really wanted to just figure out the answer through someone else questions but it seems like my issue was pretty unique.
The problem is some of the images are being added to the DOM via Javascript after the page is loaded. So
res = requests.get(comicUrl)
gets the HTML and DOM before any modification are made by javascript. This is why
soup = bs4.BeautifulSoup(res.text, 'html.parser')
comicElem = soup.find_all(class_="PB0mN")
len(comicElem) # = 5
only finds 5 images.
If you want to get all the images that are loaded you cannot use the requests library. Here is an example using selenium:
from selenium import webdriver
browser = webdriver.Chrome('/Users/glenn/Downloads/chromedriver')
comicUrl = "https://mangapanda.onl/chapter/naruto_107/chapter-700.5"
browser.get(comicUrl)
images = browser.find_elements_by_class_name("PB0mN")
for image in images:
print(image.get_attribute('src'))
len(images) # = 18 images
See this post for additional resources for scraping javascript pages:
Web-scraping JavaScript page with Python
Regarding how to tell if the HTML is being modified using javascript?
I don't have any hard rules but these are some investigative steps you can carry out:
As you observed only finding 5 images originally with requests but seeing there are more images on the page is the first clue the DOM is being changed after it is loaded.
A second step: using the browser Developer Tools -> Scripts you can see there are several javascript files associated with the page. Note that not all javascript modify the DOM so the presence of these scripts does not necessarily mean they are modifying the DOM.
For further verification the DOM is being modified after the page is loaded:
Copy the html from Developer Tools -> View Page Source into an HTML formatter tool like http://htmlformatter.com, format the html and look at the line count. The Developer Tools -> View Page Source is the html that is sent by the server without any modifications.
Then copy the html from Developer Tools -> Elements (be sure to get the whole thing from <html>...</html>) and copy this into an HTML formatter tool like http://htmlformatter.com, format and look at the line count. The Developer Tools -> Elements html is the complete, modified DOM.
If the line counts are significantly different then you know the DOM is being modified after it is loaded.
Comparing line counts for "https://mangapanda.onl/chapter/naruto_107/chapter-700.5" shows 479 lines for the source html and 3245 lines for the complete DOM so you know something is modifying the DOM after the page is loaded.

How to Get image when dynamic link comes from a website

I want to get the full resolution image displayed from this website :
http://oiswww.eumetsat.org/IPPS/html/MSG/PRODUCTS/MPE/FULLRESOLUTION/index.htm
The image has a dynamic link every time when it is updated, which cause problem if we want to download it every time.
Do you have some tricks with python to systematically download the full resolution image.
Thanks all.
You can use BeautifulSoup, lxml or a Python RegExp to parse the HTML and get the correct link, there should be an xpath to it.
From the source code of the html:
array_nom_imagen[0]="wwCzemwbmWlTk"
array_nom_imagen[1]="CtXqGo6wG8hVz"
array_nom_imagen[2]="8UFuyfrkbcd0b"
...
...
array_nom_imagen[138]="fFoSqmGjl6zhJ"
array_nom_imagen[139]="S5QefAKEdpWQf"
array_nom_imagen[140]="vCcabHqeoVgdv"
and
function loadimages(i_image) {
array_imagen[i_image] = new Image()
array_imagen[i_image].src = "IMAGESDisplay/"+array_nom_imagen[i_image]
imageurl[i_image]="IMAGESDisplay/"+array_nom_imagen[i_image]
loaded_images[i_image]="TRUE"
}
So only 141 pictures are available.

Handling image from Google Maps APIs (staticmap)

I am trying to get an image from GoogleMaps APIs, more precisely from the staticmap API.
The problem is that in other APIs from GoogleMaps you can choose wether you want your info from the API in JSON or XML, but with staticmap (which returns an image) it seems you can't.
So I don't know how to handle the image provided by the URL since I don't know how it is coded.
This is what I´m trying to do:
import requests
url = ("https://maps.googleapis.com/maps/api/staticmap?size=400x400path=weight:3%7Ccolor:orange%7Cenc:polyline_data")
response = requests.get(url)
print(response.json())
Given that the info is probably not in Json it raises the following error:
ValueError: Expecting value: line 1 column 1 (char 0)
Hope you've got any advice about how to turn the response into something usable.
ummmm... ok, you are thinking too much.
staticmap (which returns an image)
Yes, since you are right, so this is what you have put it <img src="here"/>:
Following is a demo of it. I used the example from the documentation.
<img src="https://maps.googleapis.com/maps/api/staticmap?size=400x400&path=weight:3%7Ccolor:orange%7Cenc:_fisIp~u%7CU}%7Ca#pytA_~b#hhCyhS~hResU%7C%7Cx#oig#rwg#amUfbjA}f[roaAynd#%7CvXxiAt{ZwdUfbjAewYrqGchH~vXkqnAria#c_o#inc#k{g#i`]o%7CF}vXaj\h`]ovs#?yi_#rcAgtO%7Cj_AyaJren#nzQrst#zuYh`]v%7CGbldEuzd#%7C%7Cx#spD%7CtrAzwP%7Cd_#yiB~vXmlWhdPez\_{Km_`#~re#ew^rcAeu_#zhyByjPrst#ttGren#aeNhoFemKrvdAuvVidPwbVr~j#or#f_z#ftHr{ZlwBrvdAmtHrmT{rOt{Zz}E%7Cc%7C#o%7CLpn~AgfRpxqBfoVz_iAocAhrVjr#rh~#jzKhjp#``NrfQpcHrb^k%7CDh_z#nwB%7Ckb#a{R%7Cyh#uyZ%7CllByuZpzw#wbd#rh~#%7C%7CFhqs#teTztrAupHhyY}t]huf#e%7CFria#o}GfezAkdW%7C}[ocMt_Neq#ren#e~Ika#pgE%7Ci%7CAfiQ%7C`l#uoJrvdAgq#fppAsjGhg`#%7ChQpg{Ai_V%7C%7Cx#mkHhyYsdP%7CxeA~gF%7C}[mv`#t_NitSfjp#c}Mhg`#sbChyYq}e#rwg#atFff}#ghN~zKybk#fl}A}cPftcAite#tmT__Lha#u~DrfQi}MhkSqyWivIumCria#ciO_tHifm#fl}A{rc#fbjAqvg#rrqAcjCf%7Ci#mqJtb^s%7C#fbjA{wDfs`BmvEfqs#umWt_Nwn^pen#qiBr`xAcvMr{Zidg#dtjDkbM%7Cd_#"/>
I was able to solve the problem, this is the code:
import requests
url = ("https://maps.googleapis.com/maps/api/staticmap?size=400x400path=weight:3%7Ccolor:orange%7Cenc:polyline_data")
r = requests.get(url)
image = r._content
with open("map.png","wb") as file: #with this you create a usable file .png
file.write(image)

Automate downloading images off Google

I'm very new to Python and I'm trying to create a tool that automates downloading images off Google.
So far, I have the following code:
import urllib
def google_image(x):
search = x.split()
search = '%20'.join(map(str, search))
url = 'http://ajax.googleapis.com/ajax/services/search/images?v=1.0&q=%s&safe=off' %
But I'm not sure where to continue or if I'm even on the right track. Can someone please help?
see scrapy documentation for image pipeline
ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1}

Categories