google search web scraping class= not same as on browser

google search web scraping class= not same as on browser - python

I am trying to grab video panel in google result
for example I am searching ---> "great+castles" <--
and in that search result, it has a panel that contains videos
when I scrape it I get HTML but with different values of attributes
I am not able to grab video panel
text="great+castles"
url = f'https://google.com/search?q={text}'
response = requests.get(url)
print(url)
soup = BeautifulSoup(response.text,'html.parser')
a=soup.findAll('div',{'id':'main'})
a
I do get output response but attributes are not same as on google chrome

Firstly, you can always write that HTML response in HTML file and check what actually you're getting by opening in the browser.
Secondly, you cannot scrape data from google that easily, you need proxies for that but even with elite proxies you may face number of challenges like reCaptcha etc.

You have 2 options to check source code returned by requests:
Save response as html file locally and open it in browser.
Get Scrapy framework and use view(response) from Scrapy Shell. The Scrapy option is handy but requires installation of the framework that can be an overkill for a one-time project.
There is also another(more robust) way to get results from Google Search by using [Google Search API][2] from SerpApi. It's a paid API with a free plan.
For example, your request will return handy json including the inline video section:
> "inline_videos":
> [
> {
> "position":
> 1,
> "title":
> "A Thousand Years of European Castles",
> "link":
> "https://www.youtube.com/watch?v=uXSFt-zey84",
> "thumbnail":
> "https://i.ytimg.com/vi/uXSFt-zey84/mqdefault.jpg?sqp=-oaymwEECHwQRg&rs=AMzJL3n1trdIa7_n5X-kJf8pq70OYoY47w",
> "channel":
> "Best Documentary",
> "duration":
> "53:59",
> "platform":
> "YouTube",
> "date":
> "Jan 25, 2022"
> },
> {
> "position":
> 2,
> "title":
> "The Most Beautiful Castles in the World",
> "link":
> "https://www.youtube.com/watch?v=ln-v2ibnWHU",
> "thumbnail":
> "https://i.ytimg.com/vi/ln-v2ibnWHU/mqdefault.jpg?sqp=-oaymwEECHwQRg&rs=AMzJL3kHM2n3_vkRLM_stMr0XuiFs5uaCQ",
> "channel":
> "Luxury Homes",
> "duration":
> "4:58",
> "platform":
> "YouTube",
> "date":
> "Mar 29, 2020"
> },
> {
> "position":
> 3,
> "title":
> "Great Castles of Europe: Neuschwanstein (Part 1 of 3)",
> "link":
> "https://www.youtube.com/watch?v=R_uFzANW2Xo",
> "thumbnail":
> "https://i.ytimg.com/vi/R_uFzANW2Xo/mqdefault.jpg?sqp=-oaymwEECHwQRg&rs=AMzJL3nYdSY5YW2QU1pijXo3xx7ObrILdg",
> "channel":
> "trakehnen",
> "duration":
> "8:51",
> "platform":
> "YouTube",
> "date":
> "Sep 24, 2009"
> }
> ],
Disclaimer, I work for SerpApi.
.

You can scrape Google Search Video Panel Results using BeautifulSoup web scraping library.
To get to the tab we need, you need to register it in the parameters, like this:
# this URL params is taken from the actual Google search URL
# and transformed to a more readable format
params = {
"q": "great castles", # query
"tbm" : "vid", # video panel
"gl": "us", # contry of the search
"hl": "en" # language of the search
}
To get the required data, you need to get a "container", which is a CSS selector called class selector that contains all the information about video results i.e title, link, channel name and so on.
In our case, this is the "video-voyager" selector which contains data about the title, channel name, video link, description and so on.
Have a look at the SelectorGadget Chrome extension to easily pick selectors by clicking on the desired element in your browser (not always work perfectly if the website is rendered via JavaScript).
Check code in online IDE.
from bs4 import BeautifulSoup
import requests, lxml, json
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
}
params = {
"q": "great castles", # query
"tbm" : "vid", # video panel
"gl": "us", # contry of the search
"hl": "en" # language of the search
}
# by default it will scrape video page results but can be truned off
def scrape_google_videos(inline_videos=False, video_page=True):
if inline_videos:
data_inline_video = []
params.pop("tbm", None) # deletes tbm: vid
html = requests.get("https://www.google.com/search", headers=headers, params=params, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
print("Inline video data:\n")
for result in soup.select(".WZIVy"):
title = result.select_one(".cHaqb").text
platform = result.select_one("cite").text
chanel = result.select_one(".pcJO7e span").text.replace(" · ", "")
date = result.select_one(".hMJ0yc span").text
data_inline_video.append({
"title" : title,
"platform" : platform,
"chanel" : chanel,
"date" : date
})
print(json.dumps(data_inline_video, indent=2, ensure_ascii=False))
if video_page:
data_video_panel = []
html = requests.get("https://www.google.com/search", headers=headers, params=params, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
print("Video panel data:\n")
for products in soup.select("video-voyager"):
title = products.select_one(".DKV0Md").text
description = products.select_one(".Uroaid").text
link = products.select_one(".ct3b9e a")["href"]
chanel = products.select_one(".Zg1NU+ span").text
duration = products.select_one(".J1mWY div").text
date = products.select_one(".P7xzyf span span").text
data_video_panel.append({
"title" : title,
"description" : description,
"link" : link,
"chanel" : chanel,
"duration" : duration,
"date" : date
})
print(json.dumps(data_video_panel, indent=2, ensure_ascii=False))
scrape_google_videos(video_page=True, inline_videos=False)
Inline video data:
[
{
"title": "A Thousand Years of European Castles",
"platform": "YouTube",
"chanel": "Best Documentary",
"date": "Jan 25, 2022"
},
{
"title": "MOST BEAUTIFUL Castles on Earth",
"platform": "YouTube",
"chanel": "Top Fives",
"date": "Feb 2, 2022"
},
{
"title": "Great Castles of Europe: Neuschwanstein (Part 1 of 3)",
"platform": "YouTube",
"chanel": "trakehnen",
"date": "Sep 24, 2009"
}
]

Related

Scraping Ebay, working until I use it in sold items

I will use this code to explain my doubt:
Using the url without sold filter
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
url = "https://www.ebay.es/sch/i.html?_from=R40&_trksid=p2334524.m570.l1313&_nkw=iphone+x&_sacat=0&LH_TitleDesc=0&_udlo=400&LH_Auction=1&_osacat=0&_odkw=Pok%C3%A9mon+card+Charizard+4%2F102&rt=nc"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find_all("div", {"class": "s-item__info clearfix"})
print(len(results))
Output: 12
Then I use the url where there are only sold items, I check the html and the class is the same.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
url = "https://www.ebay.es/sch/i.html?_from=R40&_nkw=iphone+x&_sacat=0&LH_TitleDesc=0&_udlo=400&LH_Auction=1&rt=nc&LH_Sold=1&LH_Complete=1"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find_all("div", {"class": "s-item__info clearfix"})
print(len(results))
Output: 0
I tried different classes but I can´t never obtain something.
Thanks.

It was a captcha problem. tHanks!

There are several reasons why the output will be empty.
This is often because the site may think it is being accessed by a bot if requests is the default user-agent in the requests library is python-requests, this can be prevented by passing your actual User-Agent to the "headers". This seems to be a reason why you get a CAPTCHA.
The next step would be if User-Agent passing didn't work would be to use rotate user-agent, for example, to switch between PC, mobile, and tablet, as well as between browsers e.g. Chrome, Firefox, Safari, Edge and so on.
Also if passing request headers is not enough. That's when you can try using proxies (ideally residential) in combination with request headers.
An additional step is to use CAPTCHA solver, for example, 2captcha. It allows bypassing all possible CAPTCHAs depending on the target website.
Check the code using BeautifulSoup in online IDE.
from bs4 import BeautifulSoup
import requests, json, lxml
# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
}
params = {
'_nkw': 'iphone_x', # search query
'LH_Sold': '1', # shows sold items
'_pgn': 1 # page number
}
data = []
limit = 10 # page limit (if needed)
while True:
page = requests.get('https://www.ebay.es/sch/i.html', params=params, headers=headers, timeout=30)
soup = BeautifulSoup(page.text, 'lxml')
print(f"Extracting page: {params['_pgn']}")
print("-" * 10)
for products in soup.select(".s-item__info"):
title = products.select_one(".s-item__title span").text
price = products.select_one(".s-item__price").text
data.append({
"title" : title,
"price" : price
})
if params['_pgn'] == limit:
break
if soup.select_one(".pagination__next"):
params['_pgn'] += 1
else:
break
print(json.dumps(data, indent=2, ensure_ascii=False))
Example output:
[
{
"title": "Apple iPhone X 64 GB y 256 GB Grado A++ Desbloqueado - Excelente Estado Todos los Colores",
"price": "234,52 EUR"
},
{
"title": "Funda de silicona a prueba de golpes para iPhone 11 Pro Max 14Pro 8 7 SE 2022 colores",
"price": "4,56 EUR"
},
{
"title": "Apple iPhone X 64 GB 256 GB gris plateado sin contrato COMO NUEVO SIN MANCHA Wow ",
"price": "377,00 EUR a 409,00 EUR"
},
{
"title": "Funda transparente de silicona completa a prueba de golpes para iPhone 11 12 13 14 PRO MAX Mini X XR 8",
"price": "1,13 EUR a 4,06 EUR"
},
{
"title": "Apple iPhone X - 256 GB - Plateado (Desbloqueado) (Leer descripción) FA1065",
"price": "163,88 EUR"
},
other results ...
]
Also you can using official eBay Finding API, has a limit of 5000 requests per day, or third-party API like Ebay Organic Results API from SerpApi. It's a paid API with a free plan that handles blocks and parsing on their backend.
Example code with pagination:
from serpapi import EbaySearch
import json
params = {
"api_key": "...", # serpapi key, https://serpapi.com/manage-api-key
"engine": "ebay", # search engine
"ebay_domain": "ebay.es", # ebay domain
"_nkw": "iphone_x", # search query
"LH_Sold": "1", # shows sold items
"_pgn": 1 # page number
}
search = EbaySearch(params) # where data extraction happens
page_num = 0
data = []
while True:
results = search.get_dict() # JSON -> Python dict
if "error" in results:
print(results["error"])
break
for organic_result in results.get("organic_results", []):
title = organic_result.get("title")
price = organic_result.get("price")
data.append({
"title" : title,
"price" : price
})
page_num += 1
print(page_num)
if "next" in results.get("pagination", {}):
params['_pgn'] += 1
else:
break
print(json.dumps(data, indent=2))
Output:
[
{
"title": "Apple iPhone X (10) Desbloqueado 64 GB/256 GB Gris espacial/Plateado - Excelente Estado",
"price": {
"raw": "297,34 EUR",
"extracted": 297.34
}
},
{
"title": "Nuevo anuncioApple iPhone X - 64GB - Bianco (Sbloccato).",
"price": {
"raw": "340,00 EUR",
"extracted": 340.0
}
},
{
"title": "Apple iPhone X - 256GB - Factory Unlocked - Good Condition",
"price": {
"raw": "230,80 EUR",
"extracted": 230.8
}
},
other results ...
]

Extract rating from google search results

I am trying to extract google search results using google api in python.I am able to extract url, link, title and snippet. But i also want to extract the rating that is displayed in the google search results.
Below is the code i am using:
#Google Search Function
def google_search(search_term, api_key, cse_id, **kwargs):
service = build("customsearch", "v1", developerKey=api_key)
res = service.cse().list(q=search_term, cx=cse_id,start = 1,hq ='company reviews', **kwargs).execute()
return res['items']
results = google_search('Swiggy', my_api_key, my_cse_id, num=10)
print(results[2]["title"])
print(results[2]["link"])
print(results[2]["displayLink"])
print(results[2]["snippet"])
I can see the first search result, on searching "swiggy company review" on google, shows rating of 3.7 but i don't know how to extract that information.Can anyone please suggest any solution?
Thanks in advance

Since Google API has been deprecated, it could be easily done scraping it using BeautifulSoup CCS selector select() (for multiple elements) / select_one() (for specific element) methods amoung other techniques.
Code and full example:
from bs4 import BeautifulSoup
import requests, lxml, json
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get(
'https://www.google.com/search?q=swiggy company review',
headers=headers).text
soup = BeautifulSoup(response, 'lxml')
# Selects just one Review element (using converted xPath to CSS selector):
# review = soup.select_one('#rso > div:nth-of-type(1) > div > div > div:nth-of-type(2) > div > span:nth-of-type(1)').text
# print(review)
# Selects just one Vote element (using converted xPath to CSS selector):
# votes = soup.select_one('#rso > div:nth-of-type(1) > div > div > div:nth-of-type(2) > div > span:nth-of-type(2)').text
# print(votes)
data = []
# Selects multiple Vote elements:
for something in soup.select('.uo4vr'):
rating = something.select_one('.uo4vr g-review-stars+ span').text.split(':')[1].strip()
votes_reviews = something.select_one('.uo4vr span+ span').text.split(' ')[0]
data.append({
"Rating": rating,
"Votes/Reviews": votes_reviews,
})
print(json.dumps(data, indent=2))
Output:
[
{
"Rating": "4",
"Votes/Reviews": "1,219"
},
{
"Rating": "4",
"Votes/Reviews": "1,090"
},
{
"Rating": "3.8",
"Votes/Reviews": "46"
},
{
"Rating": "3.8",
"Votes/Reviews": "260"
},
{
"Rating": "4.1",
"Votes/Reviews": "1,047"
},
{
"Rating": "3.3",
"Votes/Reviews": "47"
},
{
"Rating": "1.5",
"Votes/Reviews": "114"
}
]
Alternatively, you can use Google Organic Results API from SerpApi. It's a paid API with a free trial.
Code to integrate:
from serpapi import GoogleSearch
import os, json
params = {
"engine": "google",
"q": "swiggy company review",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
# For extracting single elements:
# rating = results['organic_results'][0]['rich_snippet']['top']['detected_extensions']['rating']
# print(f"Rating: {rating}")
# votes = results['organic_results'][0]['rich_snippet']['top']['detected_extensions']['votes']
# print(f"Votes: {votes}")
# For extracing multiple elements:
data = []
for organic_result in results['organic_results']:
title = organic_result['title']
try:
rating = organic_result['rich_snippet']['top']['detected_extensions']['rating']
except:
rating = None
try:
votes = organic_result['rich_snippet']['top']['detected_extensions']['votes']
except:
votes = None
try:
reviews = organic_result['rich_snippet']['top']['detected_extensions']['reviews']
except:
reviews = None
data.append({
"Title": title,
"Rating": rating,
"Votes": votes,
"Reviews": reviews,
})
print(json.dumps(data, indent=2))
Output:
[
{
"Title": "Swiggy Reviews | Glassdoor",
"Rating": 4,
"Votes": 1219,
"Reviews": null
},
{
"Title": "Ride.Swiggy: 254 Employee Reviews | Indeed.com",
"Rating": null,
"Votes": null,
"Reviews": null
}
{
"Title": "Working at Swiggy | Glassdoor",
"Rating": 4,
"Votes": 1090,
"Reviews": null
}
]
Disclaimer, I work for SerpApi.

Scrape Google Scholar Security Page

I have a string like this:
url = 'http://scholar.google.pl/citations?view_op\x3dsearch_authors\x26hl\x3dpl\x26oe\x3dLatin2\x26mauthors\x3dlabel:security\x26after_author\x3drukAAOJ8__8J\x26astart\x3d10'
I wish to convert it to this:
converted_url = 'https://scholar.google.pl/citations?view_op=search_authors&hl=en&mauthors=label:security&after_author=rukAAOJ8__8J&astart=10'
I have tried this:
converted_url = url.decode('utf-8')
However, this error is thrown:
AttributeError: 'str' object has no attribute 'decode'

You can use requests to do decoding automatically for you.
Note: after_author URL parameter is a next page token, so when you make a request to the exact URL you provided, the returned HTML will not be the same as you expect because after_author URL parameters changes on every request, for example in my case it is different - uB8AAEFN__8J, and in your URL it's rukAAOJ8__8J.
To get it to work you need to parse the next page token from the first page that will lead to the second page and so on, for example:
# from my other answer:
# https://github.com/dimitryzub/stackoverflow-answers-archive/blob/main/answers/scrape_all_scholar_profiles_bs4.py
params = {
"view_op": "search_authors",
"mauthors": "valve",
"hl": "pl",
"astart": 0
}
authors_is_present = True
while authors_is_present:
# if next page is present -> update next page token and increment to the next page
# if next page is not present -> exit the while loop
if soup.select_one("button.gs_btnPR")["onclick"]:
params["after_author"] = re.search(r"after_author\\x3d(.*)\\x26", str(soup.select_one("button.gs_btnPR")["onclick"])).group(1) # -> XB0HAMS9__8J
params["astart"] += 10
else:
authors_is_present = False
Code and example to extract profiles data in the online IDE:
from parsel import Selector
import requests, json
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "label:security",
"hl": "pl",
"view_op": "search_authors"
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}
html = requests.get("https://scholar.google.pl/citations", params=params, headers=headers, timeout=30)
selector = Selector(html.text)
profiles = []
for profile in selector.css(".gs_ai_chpr"):
profile_name = profile.css(".gs_ai_name a::text").get()
profile_link = f'https://scholar.google.com{profile.css(".gs_ai_name a::attr(href)").get()}'
profile_email = profile.css(".gs_ai_eml::text").get()
profile_interests = profile.css(".gs_ai_one_int::text").getall()
profiles.append({
"profile_name": profile_name,
"profile_link": profile_link,
"profile_email": profile_email,
"profile_interests": profile_interests
})
print(json.dumps(profiles, indent=2))
Alternatively, you can achieve the same thing using Google Scholar Profiles API from SerpApi. It's a paid API with a free plan.
The difference is that you don't need to figure out how to extract data, bypass blocks from search engines, increase the number of requests, and so on.
Example code to integrate:
from serpapi import GoogleSearch
import os, json
params = {
"api_key": os.getenv("API_KEY"), # SerpApi API key
"engine": "google_scholar_profiles", # SerpApi profiles parsing engine
"hl": "pl", # language
"mauthors": "label:security" # search query
}
search = GoogleSearch(params)
results = search.get_dict()
for profile in results["profiles"]:
print(json.dumps(profile, indent=2))
# part of the output:
'''
{
"name": "Johnson Thomas",
"link": "https://scholar.google.com/citations?hl=pl&user=eKLr0EgAAAAJ",
"serpapi_link": "https://serpapi.com/search.json?author_id=eKLr0EgAAAAJ&engine=google_scholar_author&hl=pl",
"author_id": "eKLr0EgAAAAJ",
"affiliations": "Professor of Computer Science, Oklahoma State University",
"email": "Zweryfikowany adres z cs.okstate.edu",
"cited_by": 159999,
"interests": [
{
"title": "Security",
"serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=pl&mauthors=label%3Asecurity",
"link": "https://scholar.google.com/citations?hl=pl&view_op=search_authors&mauthors=label:security"
},
{
"title": "cloud computing",
"serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=pl&mauthors=label%3Acloud_computing",
"link": "https://scholar.google.com/citations?hl=pl&view_op=search_authors&mauthors=label:cloud_computing"
},
{
"title": "big data",
"serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=pl&mauthors=label%3Abig_data",
"link": "https://scholar.google.com/citations?hl=pl&view_op=search_authors&mauthors=label:big_data"
}
],
"thumbnail": "https://scholar.google.com/citations/images/avatar_scholar_56.png"
}
'''
Disclaimer, I work for SerpApi.

decode is used to convert bytes into string. And your url is string, not bytes.
You can use encode to convert this string into bytes and later use decode to convert to correct string.
(I use prefix r to simulate text with this problem - without prefix url doesn't have to be converted)
url = r'http://scholar.google.pl/citations?view_op\x3dsearch_authors\x26hl\x3dpl\x26oe\x3dLatin2\x26mauthors\x3dlabel:security\x26after_author\x3drukAAOJ8__8J\x26astart\x3d10'
print(url)
url = url.encode('utf-8').decode('unicode_escape')
print(url)
result:
http://scholar.google.pl/citations?view_op\x3dsearch_authors\x26hl\x3dpl\x26oe\x3dLatin2\x26mauthors\x3dlabel:security\x26after_author\x3drukAAOJ8__8J\x26astart\x3d10
http://scholar.google.pl/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:security&after_author=rukAAOJ8__8J&astart=10
BTW: first check print(url) maybe you have correct url but you use wrong method to display it. Python Shell displays all result without print() using print(repr()) which display some chars as code to show what endcoding is used in text (utf-8, iso-8859-1, win-1250, latin-1, etc.)

Simple Python social media scrape of Public information

I just want to grab public information from my accounts on two social media sites. (Instagram and Twitter) My code returns info for twitter, and I know the xpath is correct for instagram but for some reason i'm not getting data for it. I know the XPATH's could be more specific but I can fix that later. Both my accounts are public.
1) I thought maybe it didn't like the python header, so I tried changing it and I still get nothing. That line is commented out but its still there.
2) I heard something about an API on github, this lengthy code is very intimidating and way above my level of understanding. I don't know more than half of what i'm reading on there.
from lxml import html
import requests
import webbrowser
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
#page = requests.get('https://www.instagram.com/<my account>/', headers=headers)
page = requests.get('https://www.instagram.com/<my account>/')
tree = html.fromstring(page.text)
pageTwo = requests.get('http://www.twitter.com/<my account>')
treeTwo = html.fromstring(pageTwo.text)
instaFollowers = tree.xpath("//span[#data-reactid='.0.1.0.0:0.1.3.1.0']/span[2]/text()")
instaFollowing = tree.xpath("//span[#data-reactid='.0.1.0.0:0.1.3.2.0']/span[2]/text()")
twitFollowers = treeTwo.xpath("//a[#data-nav='followers']/span[#class='ProfileNav-value']/text()")
twitFollowing = treeTwo.xpath("//a[#data-nav='following']/span[#class='ProfileNav-value']/text()")
print ''
print '--------------------'
print 'Social Media Checker'
print '--------------------'
print ''
print 'Instagram: ' + str(instaFollowers) + ' / ' + str(instaFollowing)
print ''
print 'Twitter: ' + str(twitFollowers) + ' / ' + str(twitFollowing)

As mentioned, Instragram's page source does not reflect its rendered source as a Javascript function is called to pass content from JSON data to browser. Hence, what Python scrapes in page source does not show exactly what browser renders to screen. Welcome to the new world of dynamic web programming! Consider using Instagram's API or other web parser that can retrieve html generated content (not just page source).
With that said, if you simply need the IG account data you can still use Python's lxml to XPath the JSON content in <script> tag (specifically sixth occurrence but adjust to your needed page). Below example parses Google's Instagram JSON data:
import lxml.etree as et
import urllib.request as rq
rqpage = rq.urlopen('https://instagram.com/google')
txtpage = rqpage.read()
tree = et.HTML(txtpage)
jsondata = tree.xpath("//script[#type='text/javascript' and position()=6]/text()")
for i in jsondata:
print(i)
OUTPUT
window._sharedData = {"qs":"{\"shift\":10,\"header
\":\"n3bTdmHGHDgxvZYPN0KDFHqbkxd6zpTl\",\"edges\":100,\"blob
\":\"AQCq42rOTCnKOZcOxFn06L1J6_W8wY6ntAS1bX88VBClAjQD9PyJdefCzOwfSAbUdsBwHKb1QSndurPtjyN-
rHMOrZ_6ubE_Xpu908cyron9Zczkj4QMkAYUHIgnmmftuXG8rrFzq_Oq3BoXpQgovI9hefha-
6SAs1RLJMwMArrbMlFMLAwyd1TZhArcxQkk9bgRGT4MZK4Tk2VNt1YOKDN1pO3NJneFlUxdUJTdDX
zj3eY-stT7DnxF_GM_j6xwk1o\",\"iterations\":7,\"size\":42}","static_root":"
\/\/instagramstatic-a.akamaihd.net\/bluebar\/5829dff","entry_data":
{"ProfilePage":[{"__query_string":"?","__path":"\/google\/","__get_params":
{},"user":{"username":"google","has_blocked_viewer":false,"follows":
{"count":10},"requested_by_viewer":false,"followed_by":
{"count":977186},"country_block":null,"has_requested_viewer":false,"followed_
by_viewer":false,"follows_viewer":false,"profile_pic_url":"https:
\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xfp1\/t51.2885-19\/s150x150
\/11910217_933356470069152_115044571_a.jpg","is_private":false,"full_name":
"Google","media":{"count":180,"page_info":
{"has_previous_page":false,"start_cursor":"1126896719808871555","end_cursor":
"1092117490206686720","has_next_page":true},"nodes":[{"code":"-
jipiawryD","dimensions":{"width":640,"height":640},"owner":
{"id":"1067259270"},"comments":{"count":105},"caption":"Today's the day!
Your searches are served. Happy Thanksgiving \ud83c\udf57\ud83c\udf70
#GoogleTrends","likes":
{"count":11410},"date":1448556579.0,"thumbnail_src":"https:\/
\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\
/11848856_482502108621097_589421586_n.jpg","is_video":true,"id":"112689671980
8871555","display_src":"https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-
xat1\/t51.2885-15
...
JSON Pretty Print (extracting the window._sharedData variable from above)
See below where user (followers, following, etc.) data shows at beginning:
{
"qs": "{\"shift\":10,\"header\":\"n3bTdmHGHDgxvZYPN0KDFHqbkxd6zpTl\",\"edges\":100,\"blob\":\"AQCq42rOTCnKOZcOxFn06L1J6_W8wY6ntAS1bX88VBClAjQD9PyJdefCzOwfSAbUdsBwHKb1QSndurPtjyN-rHMOrZ_6ubE_Xpu908cyron9Zczkj4QMkAYUHIgnmmftuXG8rrFzq_Oq3BoXpQgovI9hefha-6SAs1RLJMwMArrbMlFMLAwyd1TZhArcxQkk9bgRGT4MZK4Tk2VNt1YOKDN1pO3NJneFlUxdUJTdDXzj3eY-stT7DnxF_GM_j6xwk1o\",\"iterations\":7,\"size\":42}",
"static_root": "\/\/instagramstatic-a.akamaihd.net\/bluebar\/5829dff",
"entry_data": {
"ProfilePage": [
{
"__query_string": "?",
"__path": "\/google\/",
"__get_params": {
},
"user": {
"username": "google",
"has_blocked_viewer": false,
"follows": {
"count": 10
},
"requested_by_viewer": false,
"followed_by": {
"count": 977186
},
"country_block": null,
"has_requested_viewer": false,
"followed_by_viewer": false,
"follows_viewer": false,
"profile_pic_url": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xfp1\/t51.2885-19\/s150x150\/11910217_933356470069152_115044571_a.jpg",
"is_private": false,
"full_name": "Google",
"media": {
"count": 180,
"page_info": {
"has_previous_page": false,
"start_cursor": "1126896719808871555",
"end_cursor": "1092117490206686720",
"has_next_page": true
},
"nodes": [
{
"code": "-jipiawryD",
"dimensions": {
"width": 640,
"height": 640
},
"owner": {
"id": "1067259270"
},
"comments": {
"count": 105
},
"caption": "Today's the day! Your searches are served. Happy Thanksgiving \ud83c\udf57\ud83c\udf70 #GoogleTrends",
"likes": {
"count": 11410
},
"date": 1448556579,
"thumbnail_src": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\/11848856_482502108621097_589421586_n.jpg",
"is_video": true,
"id": "1126896719808871555",
"display_src": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\/11848856_482502108621097_589421586_n.jpg"
},
{
"code": "-hwbf2wr0O",
"dimensions": {
"width": 640,
"height": 640
},
"owner": {
"id": "1067259270"
},
"comments": {
"count": 95
},
"caption": "Thanksgiving dinner is waiting. But first, the airport. \u2708\ufe0f #GoogleApp",
"likes": {
"count": 12621
},
...

IF anyone is interested in this sort of thing still, using selenium solved my problems.
http://pastebin.com/5eHeDt3r
Is there a faster way ?

In case you want to find information about yourself and others without hassling with code, try this piece of software. Apart from automatic scraping, it analyzes and visualizes the received information on a PDF report from such social networks: Facebook, Twitter, Instagram and from the Google Search engine.
P.S. I am the main developer and maintainer of this project.

Why does soup only shows half of the chart I'm scraping?

I'm scraping from a google search but I can only get the first row of a two row chart on the right-hand side.
The search query is:
https://www.google.com/search?q=kegerators
I've noticed that doing an inspect element doesn't really work as beautifulsoup seems to extract a different code.
The code I have is:
htmltext=br.open(query).read()
soup=BeautifulSoup(htmltext)
search = soup.findAll("div", attrs={ "class" : "_cf" })
print search
Upon looking at the code (basically looking for "b>$" - as I know I should see 8 of those) I only get 4, which happen to be the top row of the chart.
These is the result of the search:
[<div class="_cf" style="overflow:hidden"><span class="_vf" style="height:86px;width:86px"><span class="_uf"></span><img class="_wf" src="http://t3.gstatic.com/shopping?q=tbn:ANd9GcRY5NBoY-anFlJUYExmil81vJG5i1nw6LqVu64lSjw8tSPBUEdh3JaiFix-gfSKMGtE2ZwX8w&usqp=CAc"/></span><div style="height:2.4em;overflow:hidden">EdgeStar Ultra Low Temp F...</div><div><b>$599.00</b></div><div><cite style="white-space:nowrap">Kegerator</cite></div></div>, <div class="_cf" style="overflow:hidden"><span class="_vf" style="height:86px;width:86px"><span class="_uf"></span><img class="_wf" src="http://t3.gstatic.com/shopping?q=tbn:ANd9GcRS4iCsD4EDV37Rg1kZf0nxFK3bYgYaWC-bxMv-ISg4dI8m-COU3ZHCZGs3FdJBK3npkpoE&usqp=CAc"/></span><div style="height:2.4em;overflow:hidden">Kegco K199SS‑2 D...</div><div><b>$539.99</b></div><div><cite style="white-space:nowrap">BeverageFa...</cite></div></div>, <div class="_cf" style="overflow:hidden"><span class="_vf" style="height:86px;width:86px"><span class="_uf"></span><img class="_wf" src="http://t2.gstatic.com/shopping?q=tbn:ANd9GcSkf6-jVZt34pd_6QyqZGre06VxszvFZX70-wUOEDRhEFhorX_Yek0oyr-5jvk8FNpj2KWusQ&usqp=CAc"/></span><div style="height:2.4em;overflow:hidden">EdgeStar Ultra Low Temp F...</div><div><b>$499.00</b></div><div><cite style="white-space:nowrap">Compact Ap...</cite></div></div>, <div class="_cf" style="overflow:hidden"><span class="_vf" style="height:86px;width:86px"><span class="_uf"></span><img class="_wf" src="http://t1.gstatic.com/shopping?q=tbn:ANd9GcTf56EQ6DVbOk02D7cLgVmlurU-2gNrhD6a74MnzQBWg1W290DTYQuj0sSUxQEbxo1XO6pB&usqp=CAc"/></span><div style="height:2.4em;overflow:hidden">FunTime Black Kegge...</div><div><b>$399.99</b></div><div><cite style="white-space:nowrap">Max Tool</cite></div></div>]
Is Google doing something strange here?

The reason why results might differ is that Google displays different results on each request, e.g. sometimes it could get 10 shopping results, sometimes 7 or 4.
Specifying gl (country, e.g: us), hl (language, e.g: en) query params could get exact or close to the exact result that you see in your browser.
Also, don't forget to specify a user-agent, otherwise, Google will block your requests eventually.
Code and example in the online IDE:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
"(KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "buy coffe", # intentional grammatical error to display right side shopping results
"hl": "en",
"gl": "us"
}
response = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(response.text, 'html.parser')
# scrapes both top and right side shopping resutls
for result in soup.select('.pla-hovercard-content-ellip'):
title = result.select_one('.pymv4e').text
link = result.select_one('.pla-hovercard-content-ellip a.tkXAec')['href']
ad_link = f"https://www.googleadservices.com/pagead{result.select_one('.pla-hovercard-content-ellip a')['href']}"
price = result.select_one('.qptdjc').text
try:
rating = result.select_one('.Fam1ne.tPhRLe')["aria-label"].replace("Rated ", "").replace(" out of ", "").replace(",", "")
except:
rating = None
try:
reviews = result.select_one('.GhQXkc').text.replace("(", "").replace(")", "")
except:
reviews = None
source = result.select_one('.zPEcBd.LnPkof').text.strip()
print(f'{title}\n{link}\n{ad_link}\n{price}\n{rating}\n{reviews}\n{source}\n')
-------------
'''
MUD\WTR | Mushroom Coffee Replacement, 90 servings
https://mudwtr.com/collections/shop/products/90-serving-bag
https://www.googleadservices.com/pagead/aclk?sa=l&ai=DChcSEwj5p8u-2rzyAhV2yJQJHfzhBoUYABAHGgJ5bQ&sig=AOD64_3NGBzLzkTv61K7kSrD2f9AREHH_g&ctype=5&q=&ved=2ahUKEwji7MK-2rzyAhWaaM0KHcnaDDcQ9aACegQIAhBo&adurl=
$125.00
4.85
1k+
mudwtr.com
...
'''
Alternatively, you can do the same thing using Google Inline Shopping API from SerpApi. It's a paid API with a free plan.
The difference is that everything is already extracted, and all that needs to be done is just to iterate over structured JSON.
Code to integrate:
import json, os
from serpapi import GoogleSearch
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "buy coffe",
"hl": "en",
"gl": "us",
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['shopping_results']:
print(json.dumps(result, indent=2, ensure_ascii=False))
--------
'''
{
"position": 1,
"block_position": "right",
"title": "Maxwell House Original Roast | 48oz",
"price": "$10.49",
"extracted_price": 10.49,
"link": "https://www.google.com/aclk?sa=l&ai=DChcSEwiGn8aT2rzyAhXgyZQJHZHdBJMYABAEGgJ5bQ&ae=2&sig=AOD64_0jBjdUIMeqJvrXYxn4NGcpwCYrJQ&ctype=5&q=&ved=2ahUKEwiOxLmT2rzyAhWiFVkFHWMNAaEQ5bgDegQIAhBa&adurl=",
"source": "Boxed",
"rating": 4.6,
"reviews": 2000,
"thumbnail": "https://serpapi.com/searches/611e1b2cfdca3e6a1c9335e6/images/e4ae7f31164ec52021f1c04d8be4e4bda2138b1acd12c868052125eb86ead292.png"
}
...
'''
P.S - I wrote a blog post about this topic that you can find here.
Disclaimer, I work for SerpApi.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

google search web scraping class= not same as on browser - python

Firstly, you can always write that HTML response in HTML file and check what actually you're getting by opening in the browser. Secondly, you cannot scrape data from google that easily, you need proxies for that but even with elite proxies you may face number of challenges like reCaptcha etc.

Related

Scraping Ebay, working until I use it in sold items

Extract rating from google search results

Scrape Google Scholar Security Page

Simple Python social media scrape of Public information

Why does soup only shows half of the chart I'm scraping?

Categories

Resources