I'm writing a program that will scrape wind speed and direction data from Google. I've seen other results online where it works out fine, but for some reason, it's not working out for me. I am specifically interested in scraping the elements with "img" tags. Here is my code:
import requests
import bs4
import geocoder
lat, long = 40.776903698619975, -74.45007646247723
base_url = r"https://www.google.com/search?q="
geoc = geocoder.osm([lat, long], method='reverse').json["raw"]["address"]
search_query = geoc["state"] + " " + geoc["country"] + " wind conditions"
lowest_admin_levels = ("municipality", "town", "city", "county")
level_found = False
for level in lowest_admin_levels:
try:
search_query = geoc[level] + " " + search_query
level_found = True
break
except KeyError:
continue
url = base_url + search_query.replace(" ", "+")
print(url)
page = requests.get(url)
soup = bs4.BeautifulSoup(page.content, 'html.parser')
print(soup.find_all('img'))
The lat/long variables could be any coordinates, those are just examples. soup.find_all('img') returns just one "img" element, when in reality, the page has multiple "img"s containing arrows rotated according to the wind direction, which you can see in this link https://www.google.com/search?q=Morris+Township+New+Jersey+United+States+wind+conditions. Thank you!
As the comment already says, Google loads the images dynamically using JavaScript. The requests library and Beautiful soup are not able to get those JavaScript loaded images. That's why you need Selenium, to get those images.
Installation
pip install selenium
pip install webdriver-manager
Solution
import geocoder
# New imports
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
lat, long = 40.776903698619975, -74.45007646247723
BASE_URL = r"https://www.google.com/search?q="
geoc = geocoder.osm([lat, long], method='reverse').json["raw"]["address"]
search_query = geoc["state"] + " " + geoc["country"] + " wind conditions"
lowest_admin_levels = ("municipality", "town", "city", "county")
for level in lowest_admin_levels:
try:
search_query = geoc[level] + " " + search_query
break
except KeyError:
continue
url = BASE_URL + search_query.replace(" ", "+")
chrome_options = Options()
# The options make the browser headless, so you don't see it
# comment out those two lines to see whats happening
chrome_options.add_argument("--headless")
chrome_options.add_argument("--window-size=1920x1080")
driver = webdriver.Chrome(ChromeDriverManager().install(), options=chrome_options) # You could specify the path to the chrome driver instead
driver.get(url)
time.sleep(2)
imgs = driver.find_elements_by_tag_name('img') # all the image tags
for img in imgs:
image_source = img.get_attribute('src') # The src of the img tag
print(image_source)
When you remove the headless option, you will see what selenium “sees”. Using Selenium, you can also click around on the website and interact with it, as you would as a normal user.
It's doesn't require geocoder nor selenium. Check out SelectorGadget Chrome extension to visually grab CSS selectors by clicking on the desired element.
Also, you can get wind direction from the same element, e.g. class='wob_t' -> area-label:
<span class="wob_t" style="display:inline;text-align:right" aria-label="8 km/h From northwest Tuesday 10:00">8 km/h</span>
Which is the same as in the <img> element (look at alt):
<img src="//ssl.gstatic.com/m/images/weather/wind_unselected.svg" alt="8 km/h From northwest" style="transform-origin:50% 50%;transform:rotate(408deg);width:16px" aria-hidden="true" data-atf="1" data-frt="0" class="">
Code and full example that scrapes more in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "london weather",
"hl": "en",
"gl": "us"
}
response = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(response, 'lxml')
for weather_result in soup.select('.wob_noe .wob_hw'):
try:
wind_speed = weather_result.select_one('.wob_t').text
'''
extracts elements, splits the string by a SPACE, and grabs 2nd and 4th index,
and then joins via SPACE. Or just use regex instead.
Example:
7 mph From northwest Sunday 9:00 AM ---> From northeast
'''
wind_direction = ' '.join(weather_result.select_one('.wob_t')['aria-label'].split(' ')[2:4])
print(f"{wind_speed}\n{wind_direction}\n")
except:
pass # or None instead
----------
'''
8 mph
From northeast
11 mph
From east
9 mph
From northeast
...
'''
Alternatively, you can use Google Direct Answer Box API from SerpApi. It's a paid API with a free plan.
Essentially, you don't need to figure out extracting part of the process and all that really needs to be done is just to iterate over structured JSON and use whatever you need from it (apart from bypass blocks from Google or maintain the parser over time).
Code to integrate:
from serpapi import GoogleSearch
import os, json
params = {
"engine": "google",
"q": "london weather",
"api_key": os.getenv("API_KEY"),
"hl": "en",
"gl": "us",
}
search = GoogleSearch(params)
results = search.get_dict()
forecast = results['answer_box']['forecast']
print(json.dumps(forecast, indent=2))
----------
'''
[
{
"day": "Tuesday",
"weather": "Partly cloudy",
"temperature": {
"high": "72",
"low": "57"
},
"thumbnail": "https://ssl.gstatic.com/onebox/weather/48/partly_cloudy.png"
}
...
]
'''
Disclaimer, I work for SerpApi.
Related
I'm trying to scrape google for related searches when given a list of keywords, and then output these related searches into a csv file. My problem is getting beautiful soup to identify the related searches html tags.
Here is an example html tag in the source code:
<div data-ved="2ahUKEwitr8CPkLT3AhVRVsAKHVF-C80QmoICKAV6BAgEEBE">iphone xr</div>
Here are my webdriver settings:
from selenium import webdriver
user_agent = 'Chrome/100.0.4896.60'
webdriver_options = webdriver.ChromeOptions()
webdriver_options.add_argument('user-agent={0}'.format(user_agent))
capabilities = webdriver_options.to_capabilities()
capabilities["acceptSslCerts"] = True
capabilities["acceptInsecureCerts"] = True
Here is my code as is:
queries = ["iphone"]
driver = webdriver.Chrome(options=webdriver_options, desired_capabilities=capabilities, port=4444)
df2 = []
driver.get("https://google.com")
time.sleep(3)
driver.find_element(By.CSS_SELECTOR, "[aria-label='Agree to the use of cookies and other data for the purposes described']").click()
# get_current_related_searches
for query in queries:
driver.get("https://google.com/search?q=" + query)
time.sleep(3)
soup = BeautifulSoup(driver.page_source, 'html.parser')
p = soup.find_all('div data-ved')
print(p)
d = pd.DataFrame({'loop': 1, 'source': query, 'from': query, 'to': [s.text for s in p]})
terms = d["to"]
df2.append(d)
time.sleep(3)
df = pd.concat(df2).reset_index(drop=False)
df.to_csv("related_searches.csv")
Its the p=soup.find_all which is incorrect I'm just not sure how to get BS to identify these specific html tags. Any help would be great :)
#jakecohensol, as you've pointed out, the selector in p = soup.find_all is wrong. The correct CSS selector: .y6Uyqe .AB4Wff.
Chrome/100.0.4896.60 User-Agent header is incorrect. Google blocks requests with such an agent string. With the full User-Agent string Google returns a proper HTML response.
Google Related Searches can be scraped without a browser. It will be faster and more reliable.
Here's your fixed code snippet (link to the full code in online IDE)
import time
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {
"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 14526.89.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.133 Safari/537.36"
}
queries = ["iphone", "pixel", "samsung"]
df2 = []
# get_current_related_searches
for query in queries:
params = {"q": query}
response = requests.get("https://google.com/search", params=params, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
p = soup.select(".y6Uyqe .AB4Wff")
d = pd.DataFrame(
{"loop": 1, "source": query, "from": query, "to": [s.text for s in p]}
)
terms = d["to"]
df2.append(d)
time.sleep(3)
df = pd.concat(df2).reset_index(drop=False)
df.to_csv("related_searches.csv")
Sample output:
,index,loop,source,from,to
0,0,1,iphone,iphone,iphone 13
1,1,1,iphone,iphone,iphone 12
2,2,1,iphone,iphone,iphone x
3,3,1,iphone,iphone,iphone 8
4,4,1,iphone,iphone,iphone 7
5,5,1,iphone,iphone,iphone xr
6,6,1,iphone,iphone,find my iphone
7,0,1,pixel,pixel,pixel 6
8,1,1,pixel,pixel,google pixel
9,2,1,pixel,pixel,pixel phone
10,3,1,pixel,pixel,pixel 6 pro
11,4,1,pixel,pixel,pixel 3
12,5,1,pixel,pixel,google pixel price
13,6,1,pixel,pixel,pixel 6 release date
14,0,1,samsung,samsung,samsung galaxy
15,1,1,samsung,samsung,samsung tv
16,2,1,samsung,samsung,samsung tablet
17,3,1,samsung,samsung,samsung account
18,4,1,samsung,samsung,samsung mobile
19,5,1,samsung,samsung,samsung store
20,6,1,samsung,samsung,samsung a21s
21,7,1,samsung,samsung,samsung login
Have a look at SelectorGadget Chrome extension to get CSS selector by clicking on desired element in your browser that returns a HTML element.
Check out what's your user agent, or find multiple user agents for mobile, tablet, PC, or different OS in order to rotate user agents which reduces the chance of being blocked a little bit.
The ideal scenario is to combine rotating user agents with rotated proxies (ideally residential), and CAPTCHA solver to solve Google CAPTCHA that will appear eventually.
As an alternative, there's a Google Search Engine Results API to scrape Google search results if you don't want to figure out how to create and maintain the parser from scratch, or how bypass blocks from Google (or other search engines).
Example code to integrate:
import os
from serpapi import GoogleSearch
queries = [
'banana',
'minecraft',
'apple stock',
'how to create a apple pie'
]
def serpapi_scrape_related_queries():
related_searches = []
for query in queries:
print(f'extracting related queries from query: {query}')
params = {
'api_key': os.getenv('API_KEY'), # your serpapi api key
'device': 'desktop', # device to retrive results from
'engine': 'google', # serpapi parsing engine
'q': query, # search query
'gl': 'us', # country of the search
'hl': 'en' # language of the search
}
search = GoogleSearch(params) # where data extracts on the backend
results = search.get_dict() # JSON -> dict
for result in results['related_searches']:
query = result['query']
link = result['link']
related_searches.append({
'query': query,
'link': link
})
pd.DataFrame(data=related_searches).to_csv('serpapi_related_queries.csv', index=False)
serpapi_scrape_related_queries()
Part of the dataframe output:
query link
0 banana benefits https://www.google.com/search?gl=us&hl=en&q=Ba...
1 banana republic https://www.google.com/search?gl=us&hl=en&q=Ba...
2 banana tree https://www.google.com/search?gl=us&hl=en&q=Ba...
3 banana meaning https://www.google.com/search?gl=us&hl=en&q=Ba...
4 banana plant https://www.google.com/search?gl=us&hl=en&q=Ba...
I'm using beautiful soup to find the first hit from a google search.
Looking for "Stack Overflow" it should find https://www.stackoverflow.com
The code is mainly taken from here However, it suddenly stopped working with results[0] being index out of range.
print results[0] IndexError: list index out of range
I suspect it's a cache problem as it was working fine and then stopped without changing the code. I've also rebooted and cleared the cache but still no results.
#!/usr/bin/python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import webbrowser # for webrowser, duh!
import re
#------------------------------------------------
def write_it(s, f):
# w for over write
file = open(f, "w")
file.write(s)
file.close()
#------------------------------------------------
def URL_encode_space(s):
return re.sub(r"\s", "%20", s)
#------------------------------------------------
def URL_decode_space(s):
return re.sub(r"%20", " ", s)
#------------------------------------------------
urlBase = "https://google.com"
searchRequest = "Stack Overflow"
print searchRequest
searchRequest = URL_encode_space(searchRequest)
# String literal for HTML quote
q = "%22" # is a "
numOfResults = 10
myURL = urlBase + "/search?q=" + q + searchRequest + q + "&num={" + str(numOfResults) + "}"
page = requests.get(myURL)
soup = BeautifulSoup(page.text, "html.parser")
links = soup.findAll("a")
results = []
for link in links:
link_href = link.get('href')
if "url?q=" in link_href and not "webcache" in link_href:
print (link.get('href').split("?q=")[1].split("&sa=U")[0])
results.append(link.get('href').split("?q=")[1].split("&sa=U")[0])
print results[0]
# open web browser?
webbrowser.open(myURL)
I can obviously check the 'len(results)' to remove the error but that doesn't explain why it no longer works.
Just like people said above it doesn't clear what could cause the problem.
Make sure you're using user agent.
I took this code from my other answer (scraping headings, summary, and links from google search results).
Code and full example:
from bs4 import BeautifulSoup
import requests
import json
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=java&oq=java',
headers=headers).text
soup = BeautifulSoup(html, 'lxml')
summary = []
for container in soup.findAll('div', class_='tF2Cxc'):
heading = container.find('h3', class_='LC20lb DKV0Md').text
article_summary = container.find('span', class_='aCOpRe').text
link = container.find('a')['href']
summary.append({
'Heading': heading,
'Article Summary': article_summary,
'Link': link,
})
print(json.dumps(summary, indent=2, ensure_ascii=False))
Alternatively, you can use Google Organic Results API from SerpApi to get these results.
It's a paid API with a free trial.
Part of JSON:
{
"position": 1,
"title": "Java | Oracle",
"link": "https://www.java.com/",
"displayed_link": "https://www.java.com",
"snippet": "Java Download. » What is Java? » Need Help? » Uninstall. About Java. Go Java Java Training Java + Greenfoot Oracle Code One Oracle Academy for ..."
}
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "stackoverflow",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(f"Link: {result['link']}")
Output:
Link: https://stackoverflow.com/
Link: https://en.wikipedia.org/wiki/Stack_Overflow
Link: https://stackoverflow.blog/
Link: https://stackoverflow.blog/podcast/
Link: https://www.linkedin.com/company/stack-overflow
Link: https://www.crunchbase.com/organization/stack-overflow
Disclaimer, I work for SerpApi.
I'm learning how to use BeautifulSoup and I'm trying to read the weather from Google. I'm using this URL.
I'm getting a 'KeyError: "id"' error on the line:
if span.attrs["id"] == "wob_tm":
What does this mean and how can I solve this problem?
I got the same error specifying a different attribute, "class", so I thought it might have just been a problem with the term "class" but I'm still recieving the error no matter what I use
# Creates a list containing all appearences of the 'span' tag
# The weather value is located within a span tag
spans = soup.find_all("span")
for span in spans:
if span.attrs["id"] == "wob_tm":
print(span.content)
I expect the output to be the integer value of the weather but when I run the code I just get:
"KeyError: 'id'"
Some span tags don't have that attribute at all, so they give you the error when you try and access that. You could just refine your search:
spans = soup.find_all('span', {'id': 'wob_tm'})
This would find only objects that match. You can then just print them all:
for span in spans:
print(span.content)
Although the rest of the answers are legit, none will work in that case because the content of temperature is loaded probably using javascript so the spans you're looking won't be found. Instead you can use selenium that works fo sure. i.e.:
from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
driver.get("https://www.google.co.uk/search?sxsrf=ACYBGNSfZJRq-EqvQ7rSC0oFZW-FiL-S-Q%3A1571602469929&source=hp&ei=JcCsXb-ANoK4kwWgtK_4DQ&q=what%27s+the+weather+today&oq=whats+the+weather+&gs_l=psy-ab.3.0.0i10i70i256j0i10j0j0i10l3j0l3j0i10.663.2962..4144...0.0..0.82.1251.19......0....1..gws-wiz.....10..35i362i39j35i39j0i131.AWESAgn5njA")
temp = driver.find_element_by_id('wob_tm').text
print(temp)
The problem that there is no 'id' key in the dictionary 'attrs'. The code below will handle this case.
spans = soup.find_all("span")
for span in spans:
if span.attrs.get("id") == "wob_tm":
print(span.content)
else:
print('not wob_tm')
Weather data is not rendered with JavaScript as Kostas Charitidis mentioned.
You don't need to specify <span> element, and more over you don't need to use find_all()/findAll()/select() since you're looking just for one element that doesn't repeat anywhere else. Use select_one() instead:
soup.select_one('#wob_tm').text
# prints temperature
You can also use try/except if you want to return None:
try:
temperature = soup.select_one('#wob_tm').text
except: temperature = None
An if statement always costs you, it's nearly free to set up a try/except block. But when an Exception actually occurs, the cost is much higher.
The next problem that might cause that error would be no user-agent specified so Google would block your request eventually thus you'll receive a completely different HTML. I already answered about what is user-agent.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "london weather",
"hl": "en",
"gl": "us"
}
response = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(response.text, 'lxml')
weather_condition = soup.select_one('#wob_dc').text
tempature = soup.select_one('#wob_tm').text
precipitation = soup.select_one('#wob_pp').text
humidity = soup.select_one('#wob_hm').text
wind = soup.select_one('#wob_ws').text
current_time = soup.select_one('#wob_dts').text
print(f'Weather condition: {weather_condition}\n'
f'Temperature: {tempature}°F\n'
f'Precipitation: {precipitation}\n'
f'Humidity: {humidity}\n'
f'Wind speed: {wind}\n'
f'Current time: {current_time}\n')
----
'''
Weather condition: Mostly cloudy
Temperature: 60°F
Precipitation: 3%
Humidity: 77%
Wind speed: 3 mph
Current time: Friday 7:00 AM
'''
Alternatively, you can achieve this by using the Google Direct Answer Box API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to figure out how to extract elements since it's already done for the end-user and no need to maintain a parser over time. All that needs to be done is just to iterate over structured JSON and get what you were looking for.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"engine": "google",
"q": "london weather",
"api_key": os.getenv("API_KEY"),
"hl": "en",
"gl": "us",
}
search = GoogleSearch(params)
results = search.get_dict()
loc = results['answer_box']['location']
weather_date = results['answer_box']['date']
weather = results['answer_box']['weather']
temp = results['answer_box']['temperature']
precipitation = results['answer_box']['precipitation']
humidity = results['answer_box']['humidity']
wind = results['answer_box']['wind']
print(f'{loc}\n{weather_date}\n{weather}\n{temp}°F\n{precipitation}\n{humidity}\n{wind}\n')
-------
'''
District 3
Friday
Mostly sunny
80°F
0%
52%
5 mph
'''
Disclaimer, I work for SerpApi.
I am scraping Google Scholar and have trouble getting the right XPath expression. When I inspect the wanted elements it returns me expressions like these:
//*[#id="gs_res_ccl_mid"]/div[2]/div[2]/div[3]/a[3]
//*[#id="gs_res_ccl_mid"]/div[3]/div/div[3]/a[3]
// *[#id="gs_res_ccl_mid"]/div[6]/div[2]/div[3]/a[3]
I ended up with the generic expression:
//*[#id="gs_res_ccl_mid"]//a[3]
Also tried the alternative, with similar results:
//*[#id="gs_res_ccl_mid"]/div*/div*/div*/a[3]
The output is something like (I can not post the entire result set because I dont't have 10 points of reputation):
[
'https://scholar.google.es/scholar?cites=5812018205123467454&as_sdt=2005&sciodt=0,5&hl=es',
'https://scholar.google.es/citations?user=EOc3O8AAAAAJ&hl=es&oi=sra',
'https://scholar.google.es/citations?user=nd8O1XQAAAAJ&hl=es&oi=sra',
'https://scholar.google.es/scholar?cites=15483392402856138853&as_sdt=2005&sciodt=0,5&hl=es',
'https://scholar.google.es/scholar?cites=7733120668292842687&as_sdt=2005&sciodt=0,5&hl=es',
'https://scholar.google.es/scholar?cites=15761030700327980189&as_sdt=2005&sciodt=0,5&hl=es'
]
The problem with the output is that there are 3 unwanted elements extras and they all have this piece of text citations?user. What can I do to rid me off the unwanted elements?
My code:
def paperOthers(exp,atr=None):
thread = browser.find_elements(By.XPATH,(" %s" % exp))
xArray = []
for t in thread:
if atr == 0:
xThread = t.get_attribute('id')
elif atr == 1:
xThread = t.get_attribute('href')
else:
xThread = t.text
xArray.append(xThread)
return xArray
Which I call with:
rcites = paperOthers("//*[#id='gs_res_ccl_mid']//a[3]", 1)
Change the XPath to exclude the items with text.
rcites = paperOthers("//*[#id='gs_res_ccl_mid']//a[3][not(contains(.,'citations?user'))]",1)
XPath expression could be as simple as //*[#class="gs_fl"]/a[3]/#href:
//* selects all elements in the document until it hits a followed #class.
[#class="gs_fl"] selects element node with gs_fl class attribute.
/a[3] selects the third <a> element that is the child of the gs_fl class element.
/#href selects href attribute of an <a> element.
A w3schools XPath syntax reminder.
Code and full example in the online IDE:
from parsel import Selector
import requests
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "biology", # search query
"hl": "en" # language
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
# https://www.whatismybrowser.com/detect/what-is-my-user-agent
# used to act as a "real" user visit
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}
html = requests.get("https://scholar.google.com/scholar", params=params, headers=headers, timeout=30)
selector = Selector(html.text)
for cite_by in selector.xpath('//*[#class="gs_fl"]/a[3]/#href'):
cited_by_link = f"https://scholar.google.com/{cite_by.get()}"
print(cited_by_link)
# output:
"""
https://scholar.google.com//scholar?cites=775353062728716840&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=1275980731835430123&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=9861875288567469852&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=6048612362870884073&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=9716378516521733998&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=12429039222112550214&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=12009957625147018103&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=11605101213592406305&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=85936656034523965&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=3694569986105898338&as_sdt=2005&sciodt=0,5&hl=en
"""
Alternatively, you can achieve the same thing by using Google Scholar Organic Results API from SerpApi.
It's a paid API with a free plan that you can use without the need to figure out how to scrape the data and maintain it over time, how to scale it without getting blocked by the search engine, find reliable proxy providers, or CAPTCHA solving services.
Example code to integrate:
from serpapi import GoogleScholarSearch
import os
params = {
"api_key": os.getenv("API_KEY"), # SerpApi API key
"engine": "google_scholar", # scraping search engine
"q": "biology", # search query
"hl": "en" # langugage
}
search = GoogleScholarSearch(params)
results = search.get_dict()
for cited_by in results["organic_results"]:
cited_by_link = cited_by["inline_links"]["cited_by"]["link"]
print(cited_by_link)
# output:
"""
https://scholar.google.com/scholar?cites=775353062728716840&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=1275980731835430123&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=9861875288567469852&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=6048612362870884073&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=9716378516521733998&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=12429039222112550214&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=12009957625147018103&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=11605101213592406305&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=85936656034523965&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=3694569986105898338&as_sdt=2005&sciodt=0,5&hl=en
"""
Disclaimer, I work for SerpApi.
I get the following traceback:
Traceback (most recent call last):
File "/home/ro/image_scrape_test.py", line 20, in <module>
soup = BeautifulSoup(searched, "lxml")
File "/usr/local/lib/python3.4/dist-packages/bs4/__init__.py", line 176, in __init__
elif len(markup) <= 256:
TypeError: object of type 'NoneType' has no len()
This is my code so far:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import urllib
#searches google images
driver = webdriver.Firefox()
google_images = ("https://www.google.com/search?site=imghp&tbm=isch source=hp&biw=1366&bih=648&q=")
search_term = input("what is your search term")
searched = driver.get("{0}{1}".format(google_images, search_term))
def savepic(url):
uri = ("/home/ro/image scrape/images/download.jpg")
if url != "":
urllib.urlretrieve(url, uri)
soup = BeautifulSoup(searched, "lxml")
soup1 = soup.content
images = soup1.find_all("a")
for image in images:
savepic(image)
I'm starting out so i'd appreciate any tips on how I can improve my code.
Thankyou
driver.get() loads a webpage in the browser and returns None which makes the searched variable to have a None value.
You probably meant to get the .page_source instead:
soup = BeautifulSoup(driver.page_source, "lxml")
Two additional points here:
you don't actually need BeautifulSoup here - you can locate the desired images with selenium using, for instance, driver.find_elements_by_tag_name()
I have not tested your code, but I think you would need to add additional Explicit Waits to make selenium wait for the page to load
searched is None. Apparently, the url you are using is invalid.
You can scrape Google images by only using the beautifulsoup and requests library, selenium is not required.
For example, if you only want to extract thumbnail images (small resolution size), you can pass "content-type": "image/png" query param (solution found from MendelG) and it will return thumbnail image links.
import requests
from bs4 import BeautifulSoup
params = {
"q": "batman wallpaper",
"tbm": "isch",
"content-type": "image/png",
}
html = requests.get("https://www.google.com/search", params=params)
soup = BeautifulSoup(html.text, 'html.parser')
for img in soup.select("img"):
print(img["src"])
# https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQAxU74QyJ8jn8Qq0ZK3ur_GkxjICcvmiC30DWnk03DEsi7YUgS8XXksdyybXY&s
# https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRh5Fhah5gT9msG7vhXeQzAziS17Jp1HE_wE5O00113DtE2rJztgvxwRSonAno&s
# ...
To scrape the full-res image URL with requests and beautifulsoup you need to scrape data from the page source code via regex.
Find all <script> tags:
soup.select('script')
Match images data via regex:
matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
Match desired images (full res size) via regex:
# https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
# if you try to json.loads() without json.dumps() it will throw an error:
# "Expecting property name enclosed in double quotes"
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
matched_images_data_json)
Extract and decode them using bytes() and decode():
for fixed_full_res_image in matched_google_full_resolution_images:
original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
Code and full example in the online IDE that also downloads images to a folder:
import requests, lxml, re, json
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "pexels cat",
"tbm": "isch",
"hl": "en",
"ijn": "0",
}
html = requests.get("https://www.google.com/search", params=params, headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
def get_images_data():
print('\nGoogle Images Metadata:')
for google_image in soup.select('.isv-r.PNCib.MSM1fd.BUooTd'):
title = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['title']
source = google_image.select_one('.fxgdke').text
link = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['href']
print(f'{title}\n{source}\n{link}\n')
# this steps could be refactored to a more compact
all_script_tags = soup.select('script')
# # https://regex101.com/r/48UZhY/4
matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
# https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
# if you try to json.loads() without json.dumps it will throw an error:
# "Expecting property name enclosed in double quotes"
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)
# https://regex101.com/r/pdZOnW/3
matched_google_image_data = re.findall(r'\[\"GRID_STATE0\",null,\[\[1,\[0,\".*?\",(.*),\"All\",', matched_images_data_json)
# https://regex101.com/r/NnRg27/1
matched_google_images_thumbnails = ', '.join(
re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
str(matched_google_image_data))).split(', ')
print('Google Image Thumbnails:') # in order
for fixed_google_image_thumbnail in matched_google_images_thumbnails:
# https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
google_image_thumbnail_not_fixed = bytes(fixed_google_image_thumbnail, 'ascii').decode('unicode-escape')
# after first decoding, Unicode characters are still present. After the second iteration, they were decoded.
google_image_thumbnail = bytes(google_image_thumbnail_not_fixed, 'ascii').decode('unicode-escape')
print(google_image_thumbnail)
# removing previously matched thumbnails for easier full resolution image matches.
removed_matched_google_images_thumbnails = re.sub(
r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', '', str(matched_google_image_data))
# https://regex101.com/r/fXjfb1/4
# https://stackoverflow.com/a/19821774/15164646
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
removed_matched_google_images_thumbnails)
print('\nDownloading Google Full Resolution Images:') # in order
for index, fixed_full_res_image in enumerate(matched_google_full_resolution_images):
# https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
print(original_size_img)
get_images_data()
-------------
'''
Google Images Metadata:
9,000+ Best Cat Photos · 100% Free Download · Pexels Stock Photos
pexels.com
https://www.pexels.com/search/cat/
...
Google Image Thumbnails:
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR2cZsuRkkLWXOIsl9BZzbeaCcI0qav7nenDvvqi-YSm4nVJZYyljRsJZv6N5vS8hMNU_w&usqp=CAU
...
Full Resolution Images:
https://images.pexels.com/photos/1170986/pexels-photo-1170986.jpeg?cs=srgb&dl=pexels-evg-culture-1170986.jpg&fm=jpg
https://images.pexels.com/photos/3777622/pexels-photo-3777622.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500
...
'''
Alternatively, you can achieve the same thing by using Google Images API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to deal with regex to match and extract needed data from the source code of the page, instead, you only need to iterate over structured JSON and get what you want faster.
Code to integrate:
import os, json # json for pretty output
from serpapi import GoogleSearch
def get_google_images():
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "pexels cat",
"tbm": "isch"
}
search = GoogleSearch(params)
results = search.get_dict()
print(json.dumps(results['images_results'], indent=2, ensure_ascii=False))
get_google_images()
---------------
'''
[
...
{
"position": 100, # img number
"thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRR1FCGhFsr_qZoxPvQBDjVn17e_8bA5PB8mg&usqp=CAU",
"source": "pexels.com",
"title": "Close-up of Cat · Free Stock Photo",
"link": "https://www.pexels.com/photo/close-up-of-cat-320014/",
"original": "https://images.pexels.com/photos/2612982/pexels-photo-2612982.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500",
"is_product": false
}
]
'''
P.S - I wrote a more in-depth blog post about how to scrape Google Images, and how to reduce the chance of being blocked while web scraping search engines.
Disclaimer, I work for SerpApi.