Scraping tabbed table from AWS pricing - python

I am trying to builder scraper to scrape tabs which are tables in this page (https://aws.amazon.com/sagemaker/pricing/) I am only interested in the data thats training, processing and few others.
req = requests.get(url)
soup = bs4.BeautifulSoup(req.content)
tables = soup.find_all("table")
inst_table = str(tables[0])
But it looks like I have to use some sort of a dynamic mechanism to get the tabbed switch.
Assume we clicked on training tab, My goal is to build a file that stores scraped data
"ml.t2.medium": {
"vCPU": 2.0,
"mem_GiB": 4.0,
"price": 0.15,
"category": "Standard",
"task": "training",
}

The good news is you don't need the heavy guns of selenium.
As with AWS, there's almost alwyas an API you can query that returns the data you want.
Here's what you need and how to get it:
import json
import time
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:94.0) Gecko/20100101 Firefox/94.0",
}
endpoint = f"https://b0.p.awsstatic.com/pricing/2.0/meteredUnitMaps/" \
f"sagemaker/USD/current/sagemaker-instances.json?" \
f"timestamp={int(time.time())}"
response = requests.get(endpoint, headers=headers).json()
for region, region_data in response["regions"].items():
if region == "EU (Frankfurt)":
for instance_type, instance_data in region_data.items():
print(json.dumps(instance_data, indent=2))
Sample output for EU (Frankfurt) (shortened for brevity):
{
"rateCode": "X7Z5CZBN2ZY5QED6.JRTCKXETXF.6YS6EN2CT7",
"price": "6.1120000000",
"Instance": "ml.g4dn.12xlarge",
"Clock Speed": "2.5 GHz",
"Instance Type": "ml.g4dn.12xlarge-AsyncInf",
"Component": "AsyncInf",
"VCPU": "48",
"Memory": "192 GiB"
}
{
"rateCode": "F926HEYB3SV5TQ3Y.JRTCKXETXF.6YS6EN2CT7",
"price": "6.8000000000",
"Instance": "ml.g4dn.16xlarge",
"Clock Speed": "2.5 GHz",
"Instance Type": "ml.g4dn.16xlarge-AsyncInf",
"Component": "AsyncInf",
"VCPU": "64",
"Memory": "256 GiB"
}
{
"rateCode": "7SMSS7DTJHR8UWN7.JRTCKXETXF.6YS6EN2CT7",
"price": "1.8810000000",
"Instance": "ml.g4dn.4xlarge",
"Clock Speed": "2.5 GHz",
"Instance Type": "ml.g4dn.4xlarge-AsyncInf",
"Component": "AsyncInf",
"VCPU": "16",
"Memory": "64 GiB"
}
and much more ...

Related

Web scraping through API - Python

I'm trying to web scrape a web site through python.
URL = "https://www.boerse-frankfurt.de/bond/xs0216072230"
With the code below, I am getting no result, it shows this in output : {}
Code is below :
import requests
url = (
"https://api.boerse-frankfurt.de/v1/data/master_data_bond?isin=XS0216072230"
)
headers = {
"X-Client-TraceId": "d87b41992f6161c09e875c525c70ffcf",
"X-Security": "d361b3c92e9c50a248e85a12849f8eee",
"Client-Date": "2022-08-25T09:07:36.196Z",
}
data = requests.get(url, headers=headers).json()
print(data)
It should print :
{
"isin": "XS0216072230",
"type": {
"originalValue": "25",
"translations": {
"de": "(Industrie-) und Bankschuldverschreibungen",
"en": "Industrial and bank bonds",
},
},
"market": {
"originalValue": "OPEN",
"translations": {"de": "Freiverkehr", "en": "Open Market"},
Any help would be appreciated, I am avoiding Selenium approach for this at the moment.
Thanks in advance.
URL must have some data. https://api.boerse-frankfurt.de/v1/data/master_data_bond?isin=XS0216072230 this url is Empty
This works for me
import requests
url = (
"https://api.boerse-frankfurt.de/v1/data/master_data_bond?isin=XS0216072230"
)
header = {
"authority":"api.boerse-frankfurt.de",
"method":"GET",
"path":"/v1/data/master_data_bond?isin=XS0216072230",
"scheme":"https",
"accept":"application/json, text/plain, */*",
"accept-encoding":"gzip, deflate, br",
"accept-language":"en-US,en;q=0.6",
"client-date":"2022-08-26T18:35:26.470Z",
"origin":"https://www.boerse-frankfurt.de",
"referer":"https://www.boerse-frankfurt.de/",
"x-client-traceid":"21eb43fb86f0065542ba9a34b7f2fa93",
"x-security":"14407a81ab4670847d3d55b0d74a3aea",
}
data = requests.get(url, headers=header).json()
print(data)
But I think you might need to update x-client-traceid,client-date, and x-security regularly

Scraping Google Maps with Python and bs4 Without API

I'm triying get data from google maps with python and BeautifulSoup. For example pharmacies in a city. I will get location data (lat-lon), name of pharmacy(ie, MDC Pharmacy), score of pharmcy(3.2), number of rewiews(10), addres with zip code, and phone number of pharmacy.
I have tried python and BeautifulSoup but I'm stuck because I don't know how to extract the data. Class method isn't working. When I prettifing and printing to the results I have seen all of data. So how can I clean them for a pandas data frame? I need more codes both for clean data and add them a list or df. Also classobject turning noobject type. Here my codes:
import requests
from bs4 import BeautifulSoup
r=requests.get("https://www.google.com.tr/maps/search/eczane/#37.4809437,36.7749346,57378m/data=!3m1!1e3")
soup= BeautifulSoup(r.content,"lxml")
a=soup.prettify()
l=soup.find("div",{"class":"mapsConsumerUiSubviewSectionGm2Placeresultcontainer__result-container mapsConsumerUiSubviewSectionGm2Placeresultcontainer__one-action mapsConsumerUiSubviewSectionGm2Placeresultcontainer__wide-margin"})
print(a)
Printresult.jpg
I have this result I need extract data from here (above).
I want a result like this table (below). Thanks...
wanted resul(it is just a sample)
You don't need selenium for this. You don't even need BeautifulSoup (in fact, it doesn't help at all). Here is code that fetches the page, isolates the initialization data JSON, decodes it, and prints the resulting Python structure.
You would need to print out the structure, and start doing some counting to find the data you want, but it's all here.
import requests
import json
from pprint import pprint
r=requests.get("https://www.google.com.tr/maps/search/eczane/#37.4809437,36.7749346,57378m/data=!3m1!1e3")
txt = r.text
find1 = "window.APP_INITIALIZATION_STATE="
find2 = ";window.APP"
i1 = txt.find(find1)
i2 = txt.find(find2, i1+1 )
js = txt[i1+len(find1):i2]
data = json.loads(js)
pprint(data)
It might be also worth looking into a third party solution like SerpApi. It's a paid API with a free trial.
Example python code (available in other libraries also):
from serpapi import GoogleSearch
params = {
"api_key": "secret_api_key",
"engine": "google_maps",
"q": "eczane",
"google_domain": "google.com",
"hl": "en",
"ll": "#37.5393407,36.707705,11z",
"type": "search"
}
search = GoogleSearch(params)
results = search.get_dict()
Example JSON output:
"local_results": [
{
"position": 1,
"title": "Ocak Eczanesi",
"place_id": "ChIJcRipbonnLRUR4DG-UuCnB2I",
"data_id": "0x152de7896ea91871:0x6207a7e052be31e0",
"data_cid": "7063799122456621536",
"reviews_link": "https://serpapi.com/search.json?data_id=0x152de7896ea91871%3A0x6207a7e052be31e0&engine=google_maps_reviews&hl=en",
"photos_link": "https://serpapi.com/search.json?data_id=0x152de7896ea91871%3A0x6207a7e052be31e0&engine=google_maps_photos&hl=en",
"gps_coordinates": {
"latitude": 37.5775156,
"longitude": 36.957789399999996
},
"place_id_search": "https://serpapi.com/search.json?data=%214m5%213m4%211s0x152de7896ea91871%3A0x6207a7e052be31e0%218m2%213d37.5775156%214d36.957789399999996&engine=google_maps&google_domain=google.com&hl=en&type=place",
"rating": 3.5,
"reviews": 8,
"type": "Drug store",
"address": "Kanuni Mh. Milcan Cd. Pk:46100 Merkez, 46100 Dulkadiroğlu/Kahramanmaraş, Turkey",
"open_state": "Closes soon ⋅ 6PM ⋅ Opens 8:30AM Fri",
"hours": "Closing soon: 6:00 PM",
"phone": "+90 344 231 68 00",
"website": "https://kahramanmaras.bel.tr/nobetci-eczaneler",
"thumbnail": "https://lh5.googleusercontent.com/p/AF1QipN5CQRdoKc_BdCgSDiEdi0nEkk1X_VUy1PP4wN3=w93-h92-k-no"
},
{
"position": 2,
"title": "Nobetci eczane",
"place_id": "ChIJP4eh2WndLRURD6IcnOov0dA",
"data_id": "0x152ddd69d9a1873f:0xd0d12fea9c1ca20f",
"data_cid": "15046860514709512719",
"reviews_link": "https://serpapi.com/search.json?data_id=0x152ddd69d9a1873f%3A0xd0d12fea9c1ca20f&engine=google_maps_reviews&hl=en",
"photos_link": "https://serpapi.com/search.json?data_id=0x152ddd69d9a1873f%3A0xd0d12fea9c1ca20f&engine=google_maps_photos&hl=en",
"gps_coordinates": {
"latitude": 37.591462,
"longitude": 36.8847051
},
"place_id_search": "https://serpapi.com/search.json?data=%214m5%213m4%211s0x152ddd69d9a1873f%3A0xd0d12fea9c1ca20f%218m2%213d37.591462%214d36.8847051&engine=google_maps&google_domain=google.com&hl=en&type=place",
"rating": 3.3,
"reviews": 12,
"type": "Pharmacy",
"address": "Mimar Sinan, 48007. Sk. No:19, 46050 Kahramanmaraş Merkez/Kahramanmaraş, Turkey",
"open_state": "Open now",
"thumbnail": "https://lh5.googleusercontent.com/p/AF1QipNznf-hC_y9KdijwUMqdO9YIcn7rbN8ZQpdIHK5=w163-h92-k-no"
},
...
]
Check out the documentation for more details.
Disclaimer: I work at SerpApi.

Python - How to retrieve element from json

Aloha,
My python routine will retrieve json from site, then check the file and download another json given the first answer and eventually download a zip.
The first json file gives information about doc.
Here's an example :
[
{
"id": "d9789918772f935b2d686f523d066a7b",
"originalName": "130010259_AC2_R44_20200101",
"type": "SUP",
"status": "document.deleted",
"legalStatus": "APPROVED",
"name": "130010259_SUP_R44_AC2",
"grid": {
"name": "R44",
"title": "GRAND EST"
},
"bbox": [
3.4212881,
47.6171589,
8.1598899,
50.1338684
],
"documentSource": "UPLOAD",
"uploadDate": "2020-06-25T14:56:27+02:00",
"updateDate": "2021-01-19T14:33:35+01:00",
"fileIdentifier": "SUP-AC2-R44-130010259-20200101",
"legalControlStatus": 101
},
{
"id": "6a9013bdde6acfa632861aeb1a02942b",
"originalName": "130010259_AC2_R44_20210101",
"type": "SUP",
"status": "document.production",
"legalStatus": "APPROVED",
"name": "130010259_SUP_R44_AC2",
"grid": {
"name": "R44",
"title": "GRAND EST"
},
"bbox": [
3.4212881,
47.6171589,
8.1598899,
50.1338684
],
"documentSource": "UPLOAD",
"uploadDate": "2021-01-18T16:37:01+01:00",
"updateDate": "2021-01-19T14:33:29+01:00",
"fileIdentifier": "SUP-AC2-R44-130010259-20210101",
"legalControlStatus": 101
},
{
"id": "efd51feaf35b12248966cb82f603e403",
"originalName": "130010259_PM2_R44_20210101",
"type": "SUP",
"status": "document.production",
"legalStatus": "APPROVED",
"name": "130010259_SUP_R44_PM2",
"grid": {
"name": "R44",
"title": "GRAND EST"
},
"bbox": [
3.6535762,
47.665021,
7.9509455,
49.907347
],
"documentSource": "UPLOAD",
"uploadDate": "2021-01-28T09:52:31+01:00",
"updateDate": "2021-01-28T18:53:34+01:00",
"fileIdentifier": "SUP-PM2-R44-130010259-20210101",
"legalControlStatus": 101
},
{
"id": "2e1b6104fdc09c84077d54fd9e74a7a7",
"originalName": "444619258_I4_R44_20210211",
"type": "SUP",
"status": "document.pre_production",
"legalStatus": "APPROVED",
"name": "444619258_SUP_R44_I4",
"grid": {
"name": "R44",
"title": "GRAND EST"
},
"bbox": [
2.8698336,
47.3373246,
8.0881368,
50.3796449
],
"documentSource": "UPLOAD",
"uploadDate": "2021-04-19T10:20:20+02:00",
"updateDate": "2021-04-19T14:46:21+02:00",
"fileIdentifier": "SUP-I4-R44-444619258-20210211",
"legalControlStatus": 100
}
]
What I try to do is to retrieve "id" from this json file. (ex. "id": "2e1b6104fdc09c84077d54fd9e74a7a7",)
I've tried
import json
from jsonpath_rw import jsonpath, parse
import jsonpath_rw_ext as jp
with open('C:/temp/gpu/SUP/20210419/SUPGE.json') as f:
d = json.load(f)
data = json.dumps(d)
print("oriName: {}".format( jp.match1("$.id[*]",data) ) )
It doesn't work In fact, I'm not sure how jsonpath-rw is intended to work. Thankfully there was this blogpost But I'm still stuck.
Does anyone have a clue ?
With the id, I'll be able to download another json and in this json there'll be an archiveUrl to get the zipfile.
Thanks in advance.
import json
file = open('SUPGE.json')
with file as f:
d = json.load(f)
for i in d:
print(i.get('id'))
this will give you id only.
d9789918772f935b2d686f523d066a7b
6a9013bdde6acfa632861aeb1a02942b
efd51feaf35b12248966cb82f603e403
2e1b6104fdc09c84077d54fd9e74a7a7
Ok.
Here's what I've done.
import json
import urllib
# not sure it's the best way to load json from url, but it works fine
# and I could test most of code if needed.
def getResponse(url):
operUrl = urllib.request.urlopen(url)
if(operUrl.getcode()==200):
data = operUrl.read()
jsonData = json.loads(data)
else:
print("Erreur reçue", operUrl.getcode())
return jsonData
# Here I get the json from the url. *
# That part will be in the final script a parameter,
# because I got lot of territory to control
d = getResponse('https://www.geoportail-urbanisme.gouv.fr/api/document?documentFamily=SUP&grid=R44&legalStatus=APPROVED')
for i in d:
if i['status'] == 'document.production' :
print('id du doc en production :',i.get('id'))
# here we parse the id to fetch the whole document.
# Same server, same API but different url
_URL = 'https://www.geoportail-urbanisme.gouv.fr/api/document/' + i.get('id')+'/details'
d2 = getResponse(_URL)
print('archive',d2['archiveUrl'])
urllib.request.urlretrieve(d2['archiveUrl'], 'c:/temp/gpu/SUP/'+d2['metadata']+'.zip' )
# I used wget in the past and loved the progression bar.
# Maybe I'd switch to wget because of it.
# Works fine.
Thanks for your answer. I'm delighted to see that even with only the json library you could do amazing things. Just normal stuff. But amazing.
Feel free to comment if you think I've missed smthg.

Extract rating from google search results

I am trying to extract google search results using google api in python.I am able to extract url, link, title and snippet. But i also want to extract the rating that is displayed in the google search results.
Below is the code i am using:
#Google Search Function
def google_search(search_term, api_key, cse_id, **kwargs):
service = build("customsearch", "v1", developerKey=api_key)
res = service.cse().list(q=search_term, cx=cse_id,start = 1,hq ='company reviews', **kwargs).execute()
return res['items']
results = google_search('Swiggy', my_api_key, my_cse_id, num=10)
print(results[2]["title"])
print(results[2]["link"])
print(results[2]["displayLink"])
print(results[2]["snippet"])
I can see the first search result, on searching "swiggy company review" on google, shows rating of 3.7 but i don't know how to extract that information.Can anyone please suggest any solution?
Thanks in advance
Since Google API has been deprecated, it could be easily done scraping it using BeautifulSoup CCS selector select() (for multiple elements) / select_one() (for specific element) methods amoung other techniques.
Code and full example:
from bs4 import BeautifulSoup
import requests, lxml, json
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get(
'https://www.google.com/search?q=swiggy company review',
headers=headers).text
soup = BeautifulSoup(response, 'lxml')
# Selects just one Review element (using converted xPath to CSS selector):
# review = soup.select_one('#rso > div:nth-of-type(1) > div > div > div:nth-of-type(2) > div > span:nth-of-type(1)').text
# print(review)
# Selects just one Vote element (using converted xPath to CSS selector):
# votes = soup.select_one('#rso > div:nth-of-type(1) > div > div > div:nth-of-type(2) > div > span:nth-of-type(2)').text
# print(votes)
data = []
# Selects multiple Vote elements:
for something in soup.select('.uo4vr'):
rating = something.select_one('.uo4vr g-review-stars+ span').text.split(':')[1].strip()
votes_reviews = something.select_one('.uo4vr span+ span').text.split(' ')[0]
data.append({
"Rating": rating,
"Votes/Reviews": votes_reviews,
})
print(json.dumps(data, indent=2))
Output:
[
{
"Rating": "4",
"Votes/Reviews": "1,219"
},
{
"Rating": "4",
"Votes/Reviews": "1,090"
},
{
"Rating": "3.8",
"Votes/Reviews": "46"
},
{
"Rating": "3.8",
"Votes/Reviews": "260"
},
{
"Rating": "4.1",
"Votes/Reviews": "1,047"
},
{
"Rating": "3.3",
"Votes/Reviews": "47"
},
{
"Rating": "1.5",
"Votes/Reviews": "114"
}
]
Alternatively, you can use Google Organic Results API from SerpApi. It's a paid API with a free trial.
Code to integrate:
from serpapi import GoogleSearch
import os, json
params = {
"engine": "google",
"q": "swiggy company review",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
# For extracting single elements:
# rating = results['organic_results'][0]['rich_snippet']['top']['detected_extensions']['rating']
# print(f"Rating: {rating}")
# votes = results['organic_results'][0]['rich_snippet']['top']['detected_extensions']['votes']
# print(f"Votes: {votes}")
# For extracing multiple elements:
data = []
for organic_result in results['organic_results']:
title = organic_result['title']
try:
rating = organic_result['rich_snippet']['top']['detected_extensions']['rating']
except:
rating = None
try:
votes = organic_result['rich_snippet']['top']['detected_extensions']['votes']
except:
votes = None
try:
reviews = organic_result['rich_snippet']['top']['detected_extensions']['reviews']
except:
reviews = None
data.append({
"Title": title,
"Rating": rating,
"Votes": votes,
"Reviews": reviews,
})
print(json.dumps(data, indent=2))
Output:
[
{
"Title": "Swiggy Reviews | Glassdoor",
"Rating": 4,
"Votes": 1219,
"Reviews": null
},
{
"Title": "Ride.Swiggy: 254 Employee Reviews | Indeed.com",
"Rating": null,
"Votes": null,
"Reviews": null
}
{
"Title": "Working at Swiggy | Glassdoor",
"Rating": 4,
"Votes": 1090,
"Reviews": null
}
]
Disclaimer, I work for SerpApi.

Simple Python social media scrape of Public information

I just want to grab public information from my accounts on two social media sites. (Instagram and Twitter) My code returns info for twitter, and I know the xpath is correct for instagram but for some reason i'm not getting data for it. I know the XPATH's could be more specific but I can fix that later. Both my accounts are public.
1) I thought maybe it didn't like the python header, so I tried changing it and I still get nothing. That line is commented out but its still there.
2) I heard something about an API on github, this lengthy code is very intimidating and way above my level of understanding. I don't know more than half of what i'm reading on there.
from lxml import html
import requests
import webbrowser
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
#page = requests.get('https://www.instagram.com/<my account>/', headers=headers)
page = requests.get('https://www.instagram.com/<my account>/')
tree = html.fromstring(page.text)
pageTwo = requests.get('http://www.twitter.com/<my account>')
treeTwo = html.fromstring(pageTwo.text)
instaFollowers = tree.xpath("//span[#data-reactid='.0.1.0.0:0.1.3.1.0']/span[2]/text()")
instaFollowing = tree.xpath("//span[#data-reactid='.0.1.0.0:0.1.3.2.0']/span[2]/text()")
twitFollowers = treeTwo.xpath("//a[#data-nav='followers']/span[#class='ProfileNav-value']/text()")
twitFollowing = treeTwo.xpath("//a[#data-nav='following']/span[#class='ProfileNav-value']/text()")
print ''
print '--------------------'
print 'Social Media Checker'
print '--------------------'
print ''
print 'Instagram: ' + str(instaFollowers) + ' / ' + str(instaFollowing)
print ''
print 'Twitter: ' + str(twitFollowers) + ' / ' + str(twitFollowing)
As mentioned, Instragram's page source does not reflect its rendered source as a Javascript function is called to pass content from JSON data to browser. Hence, what Python scrapes in page source does not show exactly what browser renders to screen. Welcome to the new world of dynamic web programming! Consider using Instagram's API or other web parser that can retrieve html generated content (not just page source).
With that said, if you simply need the IG account data you can still use Python's lxml to XPath the JSON content in <script> tag (specifically sixth occurrence but adjust to your needed page). Below example parses Google's Instagram JSON data:
import lxml.etree as et
import urllib.request as rq
rqpage = rq.urlopen('https://instagram.com/google')
txtpage = rqpage.read()
tree = et.HTML(txtpage)
jsondata = tree.xpath("//script[#type='text/javascript' and position()=6]/text()")
for i in jsondata:
print(i)
OUTPUT
window._sharedData = {"qs":"{\"shift\":10,\"header
\":\"n3bTdmHGHDgxvZYPN0KDFHqbkxd6zpTl\",\"edges\":100,\"blob
\":\"AQCq42rOTCnKOZcOxFn06L1J6_W8wY6ntAS1bX88VBClAjQD9PyJdefCzOwfSAbUdsBwHKb1QSndurPtjyN-
rHMOrZ_6ubE_Xpu908cyron9Zczkj4QMkAYUHIgnmmftuXG8rrFzq_Oq3BoXpQgovI9hefha-
6SAs1RLJMwMArrbMlFMLAwyd1TZhArcxQkk9bgRGT4MZK4Tk2VNt1YOKDN1pO3NJneFlUxdUJTdDX
zj3eY-stT7DnxF_GM_j6xwk1o\",\"iterations\":7,\"size\":42}","static_root":"
\/\/instagramstatic-a.akamaihd.net\/bluebar\/5829dff","entry_data":
{"ProfilePage":[{"__query_string":"?","__path":"\/google\/","__get_params":
{},"user":{"username":"google","has_blocked_viewer":false,"follows":
{"count":10},"requested_by_viewer":false,"followed_by":
{"count":977186},"country_block":null,"has_requested_viewer":false,"followed_
by_viewer":false,"follows_viewer":false,"profile_pic_url":"https:
\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xfp1\/t51.2885-19\/s150x150
\/11910217_933356470069152_115044571_a.jpg","is_private":false,"full_name":
"Google","media":{"count":180,"page_info":
{"has_previous_page":false,"start_cursor":"1126896719808871555","end_cursor":
"1092117490206686720","has_next_page":true},"nodes":[{"code":"-
jipiawryD","dimensions":{"width":640,"height":640},"owner":
{"id":"1067259270"},"comments":{"count":105},"caption":"Today's the day!
Your searches are served. Happy Thanksgiving \ud83c\udf57\ud83c\udf70
#GoogleTrends","likes":
{"count":11410},"date":1448556579.0,"thumbnail_src":"https:\/
\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\
/11848856_482502108621097_589421586_n.jpg","is_video":true,"id":"112689671980
8871555","display_src":"https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-
xat1\/t51.2885-15
...
JSON Pretty Print (extracting the window._sharedData variable from above)
See below where user (followers, following, etc.) data shows at beginning:
{
"qs": "{\"shift\":10,\"header\":\"n3bTdmHGHDgxvZYPN0KDFHqbkxd6zpTl\",\"edges\":100,\"blob\":\"AQCq42rOTCnKOZcOxFn06L1J6_W8wY6ntAS1bX88VBClAjQD9PyJdefCzOwfSAbUdsBwHKb1QSndurPtjyN-rHMOrZ_6ubE_Xpu908cyron9Zczkj4QMkAYUHIgnmmftuXG8rrFzq_Oq3BoXpQgovI9hefha-6SAs1RLJMwMArrbMlFMLAwyd1TZhArcxQkk9bgRGT4MZK4Tk2VNt1YOKDN1pO3NJneFlUxdUJTdDXzj3eY-stT7DnxF_GM_j6xwk1o\",\"iterations\":7,\"size\":42}",
"static_root": "\/\/instagramstatic-a.akamaihd.net\/bluebar\/5829dff",
"entry_data": {
"ProfilePage": [
{
"__query_string": "?",
"__path": "\/google\/",
"__get_params": {
},
"user": {
"username": "google",
"has_blocked_viewer": false,
"follows": {
"count": 10
},
"requested_by_viewer": false,
"followed_by": {
"count": 977186
},
"country_block": null,
"has_requested_viewer": false,
"followed_by_viewer": false,
"follows_viewer": false,
"profile_pic_url": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xfp1\/t51.2885-19\/s150x150\/11910217_933356470069152_115044571_a.jpg",
"is_private": false,
"full_name": "Google",
"media": {
"count": 180,
"page_info": {
"has_previous_page": false,
"start_cursor": "1126896719808871555",
"end_cursor": "1092117490206686720",
"has_next_page": true
},
"nodes": [
{
"code": "-jipiawryD",
"dimensions": {
"width": 640,
"height": 640
},
"owner": {
"id": "1067259270"
},
"comments": {
"count": 105
},
"caption": "Today's the day! Your searches are served. Happy Thanksgiving \ud83c\udf57\ud83c\udf70 #GoogleTrends",
"likes": {
"count": 11410
},
"date": 1448556579,
"thumbnail_src": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\/11848856_482502108621097_589421586_n.jpg",
"is_video": true,
"id": "1126896719808871555",
"display_src": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\/11848856_482502108621097_589421586_n.jpg"
},
{
"code": "-hwbf2wr0O",
"dimensions": {
"width": 640,
"height": 640
},
"owner": {
"id": "1067259270"
},
"comments": {
"count": 95
},
"caption": "Thanksgiving dinner is waiting. But first, the airport. \u2708\ufe0f #GoogleApp",
"likes": {
"count": 12621
},
...
IF anyone is interested in this sort of thing still, using selenium solved my problems.
http://pastebin.com/5eHeDt3r
Is there a faster way ?
In case you want to find information about yourself and others without hassling with code, try this piece of software. Apart from automatic scraping, it analyzes and visualizes the received information on a PDF report from such social networks: Facebook, Twitter, Instagram and from the Google Search engine.
P.S. I am the main developer and maintainer of this project.

Categories