Simple Python social media scrape of Public information - python

I just want to grab public information from my accounts on two social media sites. (Instagram and Twitter) My code returns info for twitter, and I know the xpath is correct for instagram but for some reason i'm not getting data for it. I know the XPATH's could be more specific but I can fix that later. Both my accounts are public.
1) I thought maybe it didn't like the python header, so I tried changing it and I still get nothing. That line is commented out but its still there.
2) I heard something about an API on github, this lengthy code is very intimidating and way above my level of understanding. I don't know more than half of what i'm reading on there.
from lxml import html
import requests
import webbrowser
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
#page = requests.get('https://www.instagram.com/<my account>/', headers=headers)
page = requests.get('https://www.instagram.com/<my account>/')
tree = html.fromstring(page.text)
pageTwo = requests.get('http://www.twitter.com/<my account>')
treeTwo = html.fromstring(pageTwo.text)
instaFollowers = tree.xpath("//span[#data-reactid='.0.1.0.0:0.1.3.1.0']/span[2]/text()")
instaFollowing = tree.xpath("//span[#data-reactid='.0.1.0.0:0.1.3.2.0']/span[2]/text()")
twitFollowers = treeTwo.xpath("//a[#data-nav='followers']/span[#class='ProfileNav-value']/text()")
twitFollowing = treeTwo.xpath("//a[#data-nav='following']/span[#class='ProfileNav-value']/text()")
print ''
print '--------------------'
print 'Social Media Checker'
print '--------------------'
print ''
print 'Instagram: ' + str(instaFollowers) + ' / ' + str(instaFollowing)
print ''
print 'Twitter: ' + str(twitFollowers) + ' / ' + str(twitFollowing)

As mentioned, Instragram's page source does not reflect its rendered source as a Javascript function is called to pass content from JSON data to browser. Hence, what Python scrapes in page source does not show exactly what browser renders to screen. Welcome to the new world of dynamic web programming! Consider using Instagram's API or other web parser that can retrieve html generated content (not just page source).
With that said, if you simply need the IG account data you can still use Python's lxml to XPath the JSON content in <script> tag (specifically sixth occurrence but adjust to your needed page). Below example parses Google's Instagram JSON data:
import lxml.etree as et
import urllib.request as rq
rqpage = rq.urlopen('https://instagram.com/google')
txtpage = rqpage.read()
tree = et.HTML(txtpage)
jsondata = tree.xpath("//script[#type='text/javascript' and position()=6]/text()")
for i in jsondata:
print(i)
OUTPUT
window._sharedData = {"qs":"{\"shift\":10,\"header
\":\"n3bTdmHGHDgxvZYPN0KDFHqbkxd6zpTl\",\"edges\":100,\"blob
\":\"AQCq42rOTCnKOZcOxFn06L1J6_W8wY6ntAS1bX88VBClAjQD9PyJdefCzOwfSAbUdsBwHKb1QSndurPtjyN-
rHMOrZ_6ubE_Xpu908cyron9Zczkj4QMkAYUHIgnmmftuXG8rrFzq_Oq3BoXpQgovI9hefha-
6SAs1RLJMwMArrbMlFMLAwyd1TZhArcxQkk9bgRGT4MZK4Tk2VNt1YOKDN1pO3NJneFlUxdUJTdDX
zj3eY-stT7DnxF_GM_j6xwk1o\",\"iterations\":7,\"size\":42}","static_root":"
\/\/instagramstatic-a.akamaihd.net\/bluebar\/5829dff","entry_data":
{"ProfilePage":[{"__query_string":"?","__path":"\/google\/","__get_params":
{},"user":{"username":"google","has_blocked_viewer":false,"follows":
{"count":10},"requested_by_viewer":false,"followed_by":
{"count":977186},"country_block":null,"has_requested_viewer":false,"followed_
by_viewer":false,"follows_viewer":false,"profile_pic_url":"https:
\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xfp1\/t51.2885-19\/s150x150
\/11910217_933356470069152_115044571_a.jpg","is_private":false,"full_name":
"Google","media":{"count":180,"page_info":
{"has_previous_page":false,"start_cursor":"1126896719808871555","end_cursor":
"1092117490206686720","has_next_page":true},"nodes":[{"code":"-
jipiawryD","dimensions":{"width":640,"height":640},"owner":
{"id":"1067259270"},"comments":{"count":105},"caption":"Today's the day!
Your searches are served. Happy Thanksgiving \ud83c\udf57\ud83c\udf70
#GoogleTrends","likes":
{"count":11410},"date":1448556579.0,"thumbnail_src":"https:\/
\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\
/11848856_482502108621097_589421586_n.jpg","is_video":true,"id":"112689671980
8871555","display_src":"https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-
xat1\/t51.2885-15
...
JSON Pretty Print (extracting the window._sharedData variable from above)
See below where user (followers, following, etc.) data shows at beginning:
{
"qs": "{\"shift\":10,\"header\":\"n3bTdmHGHDgxvZYPN0KDFHqbkxd6zpTl\",\"edges\":100,\"blob\":\"AQCq42rOTCnKOZcOxFn06L1J6_W8wY6ntAS1bX88VBClAjQD9PyJdefCzOwfSAbUdsBwHKb1QSndurPtjyN-rHMOrZ_6ubE_Xpu908cyron9Zczkj4QMkAYUHIgnmmftuXG8rrFzq_Oq3BoXpQgovI9hefha-6SAs1RLJMwMArrbMlFMLAwyd1TZhArcxQkk9bgRGT4MZK4Tk2VNt1YOKDN1pO3NJneFlUxdUJTdDXzj3eY-stT7DnxF_GM_j6xwk1o\",\"iterations\":7,\"size\":42}",
"static_root": "\/\/instagramstatic-a.akamaihd.net\/bluebar\/5829dff",
"entry_data": {
"ProfilePage": [
{
"__query_string": "?",
"__path": "\/google\/",
"__get_params": {
},
"user": {
"username": "google",
"has_blocked_viewer": false,
"follows": {
"count": 10
},
"requested_by_viewer": false,
"followed_by": {
"count": 977186
},
"country_block": null,
"has_requested_viewer": false,
"followed_by_viewer": false,
"follows_viewer": false,
"profile_pic_url": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xfp1\/t51.2885-19\/s150x150\/11910217_933356470069152_115044571_a.jpg",
"is_private": false,
"full_name": "Google",
"media": {
"count": 180,
"page_info": {
"has_previous_page": false,
"start_cursor": "1126896719808871555",
"end_cursor": "1092117490206686720",
"has_next_page": true
},
"nodes": [
{
"code": "-jipiawryD",
"dimensions": {
"width": 640,
"height": 640
},
"owner": {
"id": "1067259270"
},
"comments": {
"count": 105
},
"caption": "Today's the day! Your searches are served. Happy Thanksgiving \ud83c\udf57\ud83c\udf70 #GoogleTrends",
"likes": {
"count": 11410
},
"date": 1448556579,
"thumbnail_src": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\/11848856_482502108621097_589421586_n.jpg",
"is_video": true,
"id": "1126896719808871555",
"display_src": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\/11848856_482502108621097_589421586_n.jpg"
},
{
"code": "-hwbf2wr0O",
"dimensions": {
"width": 640,
"height": 640
},
"owner": {
"id": "1067259270"
},
"comments": {
"count": 95
},
"caption": "Thanksgiving dinner is waiting. But first, the airport. \u2708\ufe0f #GoogleApp",
"likes": {
"count": 12621
},
...

IF anyone is interested in this sort of thing still, using selenium solved my problems.
http://pastebin.com/5eHeDt3r
Is there a faster way ?

In case you want to find information about yourself and others without hassling with code, try this piece of software. Apart from automatic scraping, it analyzes and visualizes the received information on a PDF report from such social networks: Facebook, Twitter, Instagram and from the Google Search engine.
P.S. I am the main developer and maintainer of this project.

Related

Web scraping through API - Python

I'm trying to web scrape a web site through python.
URL = "https://www.boerse-frankfurt.de/bond/xs0216072230"
With the code below, I am getting no result, it shows this in output : {}
Code is below :
import requests
url = (
"https://api.boerse-frankfurt.de/v1/data/master_data_bond?isin=XS0216072230"
)
headers = {
"X-Client-TraceId": "d87b41992f6161c09e875c525c70ffcf",
"X-Security": "d361b3c92e9c50a248e85a12849f8eee",
"Client-Date": "2022-08-25T09:07:36.196Z",
}
data = requests.get(url, headers=headers).json()
print(data)
It should print :
{
"isin": "XS0216072230",
"type": {
"originalValue": "25",
"translations": {
"de": "(Industrie-) und Bankschuldverschreibungen",
"en": "Industrial and bank bonds",
},
},
"market": {
"originalValue": "OPEN",
"translations": {"de": "Freiverkehr", "en": "Open Market"},
Any help would be appreciated, I am avoiding Selenium approach for this at the moment.
Thanks in advance.
URL must have some data. https://api.boerse-frankfurt.de/v1/data/master_data_bond?isin=XS0216072230 this url is Empty
This works for me
import requests
url = (
"https://api.boerse-frankfurt.de/v1/data/master_data_bond?isin=XS0216072230"
)
header = {
"authority":"api.boerse-frankfurt.de",
"method":"GET",
"path":"/v1/data/master_data_bond?isin=XS0216072230",
"scheme":"https",
"accept":"application/json, text/plain, */*",
"accept-encoding":"gzip, deflate, br",
"accept-language":"en-US,en;q=0.6",
"client-date":"2022-08-26T18:35:26.470Z",
"origin":"https://www.boerse-frankfurt.de",
"referer":"https://www.boerse-frankfurt.de/",
"x-client-traceid":"21eb43fb86f0065542ba9a34b7f2fa93",
"x-security":"14407a81ab4670847d3d55b0d74a3aea",
}
data = requests.get(url, headers=header).json()
print(data)
But I think you might need to update x-client-traceid,client-date, and x-security regularly

Reading key values a JSON array which is a set in Python

I have the following code
import requests
import json
import sys
credentials_User=sys.argv[1]
credentials_Password=sys.argv[2]
email=sys.argv[3]
def auth_api(login_User,login_Password,):
gooddata_user=login_User
gooddata_password=login_Password
body = json.dumps({
"postUserLogin":{
"login": gooddata_user,
"password": gooddata_password,
"remember":1,
"verify_level":0
}
})
headers = {
'Content-Type': 'application/json',
'Accept': 'application/json'
}
url="https://reports.domain.com/gdc/account/login"
response = requests.request(
"POST",
url,
headers=headers,
data=body
)
sst=response.headers.get('Set-Cookie')
return sst
def query_api(cookie,email):
url="https://reports.domain.com/gdc/account/domains/domain/users?login="+email
body={}
headers={
'Content-Type': 'application/json',
'Accept': 'application/json',
'Cookie': cookie
}
response = requests.request(
"GET",
url,
headers=headers,
data=body
)
jsonContent=[]
jsonContent.append({response.text})
accountSettings=jsonContent[0]
print(accountSettings)
cookie=auth_api(credentials_User,credentials_Password)
profilehash=query_api(cookie,email)
The code itself works and sends a request to the Gooddata API.
The query_api() function returns JSON similar to below
{
"accountSettings": {
"items": [
{
"accountSetting": {
"login": "user#example.com",
"email": "user#example.com",
"firstName": "First Name",
"lastName": "Last Name",
"companyName": "Company Name",
"position": "Data Analyst",
"created": "2020-01-08 15:44:23",
"updated": "2020-01-08 15:44:23",
"timezone": null,
"country": "United States",
"phoneNumber": "(425) 555-1111",
"old_password": "secret$123",
"password": "secret$234",
"verifyPassword": "secret$234",
"authenticationModes": [
"SSO"
],
"ssoProvider": "sso-domain.com",
"language": "en-US",
"ipWhitelist": [
"127.0.0.1"
],
"links": {
"projects": "/gdc/account/profile/{profile_id}/projects",
"self": "/gdc/account/profile/{profile_id}",
"domain": "/gdc/domains/default",
"auditEvents": "/gdc/account/profile/{profile_id}/auditEvents"
},
"effectiveIpWhitelist": "[ 127.0.0.1 ]"
}
}
],
"paging": {
"offset": 20,
"count": 100,
"next": "/gdc/uri?offset=100"
}
}
}
The issue I am having is reading specific keys from this JSON Dict, I can use accountSettings=jsonContent[0] but that just returns the same JSON.
What I want to do is read the value of the project key within links
How would I do this with a dict?
Thanks
Based on your description, uyou have your value inside a list, (not a set. Foergt about set: sets are not used with JSON). Inside your list, you either your content as a single string, which then you'd have to parse with json.loads, or it is simply a well behaved nested data structure already extracted from JSON, but which is inside a single element list. This seems the most likely.
So, you should be able to do:
accountlink = jsonContent[0]["items"][0]["accountSetting"]["login"]
otherwise, if it is encoded as a a json string, you have to parse it first:
import json
accountlink = json.loads(jsonContent[0])["items"][0]["accountSetting"]["login"]
Now, given your question, I'd say your are on a begginer level as a programmer, or a casual user, just using Python to automatize something either way, I'd recommend you do try some exercising before proceeding: it will save you time (a lot of time). I am not trying to bully or mock anything here: this is the best advice I can offer you. Seek for tutorials that play around on the interactive mode, rather than trying entire programs at once that you'd just copy and paste.
Using the below code fixed the issue
jsonContent=json.loads(response.text)
print(type(jsonContent))
test=jsonContent["accountSettings"]["items"][0]
test2=test["accountSetting"]["links"]["self"]
print(test)
print(test2)
I believe this works because for some reason I didn't notice I was using .append for my jsonContent. This resulted in the data type being something other than it should have been.
Thanks to everyone who tried helping me.

Scraping Google Maps with Python and bs4 Without API

I'm triying get data from google maps with python and BeautifulSoup. For example pharmacies in a city. I will get location data (lat-lon), name of pharmacy(ie, MDC Pharmacy), score of pharmcy(3.2), number of rewiews(10), addres with zip code, and phone number of pharmacy.
I have tried python and BeautifulSoup but I'm stuck because I don't know how to extract the data. Class method isn't working. When I prettifing and printing to the results I have seen all of data. So how can I clean them for a pandas data frame? I need more codes both for clean data and add them a list or df. Also classobject turning noobject type. Here my codes:
import requests
from bs4 import BeautifulSoup
r=requests.get("https://www.google.com.tr/maps/search/eczane/#37.4809437,36.7749346,57378m/data=!3m1!1e3")
soup= BeautifulSoup(r.content,"lxml")
a=soup.prettify()
l=soup.find("div",{"class":"mapsConsumerUiSubviewSectionGm2Placeresultcontainer__result-container mapsConsumerUiSubviewSectionGm2Placeresultcontainer__one-action mapsConsumerUiSubviewSectionGm2Placeresultcontainer__wide-margin"})
print(a)
Printresult.jpg
I have this result I need extract data from here (above).
I want a result like this table (below). Thanks...
wanted resul(it is just a sample)
You don't need selenium for this. You don't even need BeautifulSoup (in fact, it doesn't help at all). Here is code that fetches the page, isolates the initialization data JSON, decodes it, and prints the resulting Python structure.
You would need to print out the structure, and start doing some counting to find the data you want, but it's all here.
import requests
import json
from pprint import pprint
r=requests.get("https://www.google.com.tr/maps/search/eczane/#37.4809437,36.7749346,57378m/data=!3m1!1e3")
txt = r.text
find1 = "window.APP_INITIALIZATION_STATE="
find2 = ";window.APP"
i1 = txt.find(find1)
i2 = txt.find(find2, i1+1 )
js = txt[i1+len(find1):i2]
data = json.loads(js)
pprint(data)
It might be also worth looking into a third party solution like SerpApi. It's a paid API with a free trial.
Example python code (available in other libraries also):
from serpapi import GoogleSearch
params = {
"api_key": "secret_api_key",
"engine": "google_maps",
"q": "eczane",
"google_domain": "google.com",
"hl": "en",
"ll": "#37.5393407,36.707705,11z",
"type": "search"
}
search = GoogleSearch(params)
results = search.get_dict()
Example JSON output:
"local_results": [
{
"position": 1,
"title": "Ocak Eczanesi",
"place_id": "ChIJcRipbonnLRUR4DG-UuCnB2I",
"data_id": "0x152de7896ea91871:0x6207a7e052be31e0",
"data_cid": "7063799122456621536",
"reviews_link": "https://serpapi.com/search.json?data_id=0x152de7896ea91871%3A0x6207a7e052be31e0&engine=google_maps_reviews&hl=en",
"photos_link": "https://serpapi.com/search.json?data_id=0x152de7896ea91871%3A0x6207a7e052be31e0&engine=google_maps_photos&hl=en",
"gps_coordinates": {
"latitude": 37.5775156,
"longitude": 36.957789399999996
},
"place_id_search": "https://serpapi.com/search.json?data=%214m5%213m4%211s0x152de7896ea91871%3A0x6207a7e052be31e0%218m2%213d37.5775156%214d36.957789399999996&engine=google_maps&google_domain=google.com&hl=en&type=place",
"rating": 3.5,
"reviews": 8,
"type": "Drug store",
"address": "Kanuni Mh. Milcan Cd. Pk:46100 Merkez, 46100 Dulkadiroğlu/Kahramanmaraş, Turkey",
"open_state": "Closes soon ⋅ 6PM ⋅ Opens 8:30AM Fri",
"hours": "Closing soon: 6:00 PM",
"phone": "+90 344 231 68 00",
"website": "https://kahramanmaras.bel.tr/nobetci-eczaneler",
"thumbnail": "https://lh5.googleusercontent.com/p/AF1QipN5CQRdoKc_BdCgSDiEdi0nEkk1X_VUy1PP4wN3=w93-h92-k-no"
},
{
"position": 2,
"title": "Nobetci eczane",
"place_id": "ChIJP4eh2WndLRURD6IcnOov0dA",
"data_id": "0x152ddd69d9a1873f:0xd0d12fea9c1ca20f",
"data_cid": "15046860514709512719",
"reviews_link": "https://serpapi.com/search.json?data_id=0x152ddd69d9a1873f%3A0xd0d12fea9c1ca20f&engine=google_maps_reviews&hl=en",
"photos_link": "https://serpapi.com/search.json?data_id=0x152ddd69d9a1873f%3A0xd0d12fea9c1ca20f&engine=google_maps_photos&hl=en",
"gps_coordinates": {
"latitude": 37.591462,
"longitude": 36.8847051
},
"place_id_search": "https://serpapi.com/search.json?data=%214m5%213m4%211s0x152ddd69d9a1873f%3A0xd0d12fea9c1ca20f%218m2%213d37.591462%214d36.8847051&engine=google_maps&google_domain=google.com&hl=en&type=place",
"rating": 3.3,
"reviews": 12,
"type": "Pharmacy",
"address": "Mimar Sinan, 48007. Sk. No:19, 46050 Kahramanmaraş Merkez/Kahramanmaraş, Turkey",
"open_state": "Open now",
"thumbnail": "https://lh5.googleusercontent.com/p/AF1QipNznf-hC_y9KdijwUMqdO9YIcn7rbN8ZQpdIHK5=w163-h92-k-no"
},
...
]
Check out the documentation for more details.
Disclaimer: I work at SerpApi.

google search web scraping class= not same as on browser

I am trying to grab video panel in google result
for example I am searching ---> "great+castles" <--
and in that search result, it has a panel that contains videos
when I scrape it I get HTML but with different values of attributes
I am not able to grab video panel
text="great+castles"
url = f'https://google.com/search?q={text}'
response = requests.get(url)
print(url)
soup = BeautifulSoup(response.text,'html.parser')
a=soup.findAll('div',{'id':'main'})
a
I do get output response but attributes are not same as on google chrome
Firstly, you can always write that HTML response in HTML file and check what actually you're getting by opening in the browser.
Secondly, you cannot scrape data from google that easily, you need proxies for that but even with elite proxies you may face number of challenges like reCaptcha etc.
You have 2 options to check source code returned by requests:
Save response as html file locally and open it in browser.
Get Scrapy framework and use view(response) from Scrapy Shell. The Scrapy option is handy but requires installation of the framework that can be an overkill for a one-time project.
There is also another(more robust) way to get results from Google Search by using [Google Search API][2] from SerpApi. It's a paid API with a free plan.
For example, your request will return handy json including the inline video section:
> "inline_videos":
> [
> {
> "position":
> 1,
> "title":
> "A Thousand Years of European Castles",
> "link":
> "https://www.youtube.com/watch?v=uXSFt-zey84",
> "thumbnail":
> "https://i.ytimg.com/vi/uXSFt-zey84/mqdefault.jpg?sqp=-oaymwEECHwQRg&rs=AMzJL3n1trdIa7_n5X-kJf8pq70OYoY47w",
> "channel":
> "Best Documentary",
> "duration":
> "53:59",
> "platform":
> "YouTube",
> "date":
> "Jan 25, 2022"
> },
> {
> "position":
> 2,
> "title":
> "The Most Beautiful Castles in the World",
> "link":
> "https://www.youtube.com/watch?v=ln-v2ibnWHU",
> "thumbnail":
> "https://i.ytimg.com/vi/ln-v2ibnWHU/mqdefault.jpg?sqp=-oaymwEECHwQRg&rs=AMzJL3kHM2n3_vkRLM_stMr0XuiFs5uaCQ",
> "channel":
> "Luxury Homes",
> "duration":
> "4:58",
> "platform":
> "YouTube",
> "date":
> "Mar 29, 2020"
> },
> {
> "position":
> 3,
> "title":
> "Great Castles of Europe: Neuschwanstein (Part 1 of 3)",
> "link":
> "https://www.youtube.com/watch?v=R_uFzANW2Xo",
> "thumbnail":
> "https://i.ytimg.com/vi/R_uFzANW2Xo/mqdefault.jpg?sqp=-oaymwEECHwQRg&rs=AMzJL3nYdSY5YW2QU1pijXo3xx7ObrILdg",
> "channel":
> "trakehnen",
> "duration":
> "8:51",
> "platform":
> "YouTube",
> "date":
> "Sep 24, 2009"
> }
> ],
Disclaimer, I work for SerpApi.
.
You can scrape Google Search Video Panel Results using BeautifulSoup web scraping library.
To get to the tab we need, you need to register it in the parameters, like this:
# this URL params is taken from the actual Google search URL
# and transformed to a more readable format
params = {
"q": "great castles", # query
"tbm" : "vid", # video panel
"gl": "us", # contry of the search
"hl": "en" # language of the search
}
To get the required data, you need to get a "container", which is a CSS selector called class selector that contains all the information about video results i.e title, link, channel name and so on.
In our case, this is the "video-voyager" selector which contains data about the title, channel name, video link, description and so on.
Have a look at the SelectorGadget Chrome extension to easily pick selectors by clicking on the desired element in your browser (not always work perfectly if the website is rendered via JavaScript).
Check code in online IDE.
from bs4 import BeautifulSoup
import requests, lxml, json
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
}
params = {
"q": "great castles", # query
"tbm" : "vid", # video panel
"gl": "us", # contry of the search
"hl": "en" # language of the search
}
# by default it will scrape video page results but can be truned off
def scrape_google_videos(inline_videos=False, video_page=True):
if inline_videos:
data_inline_video = []
params.pop("tbm", None) # deletes tbm: vid
html = requests.get("https://www.google.com/search", headers=headers, params=params, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
print("Inline video data:\n")
for result in soup.select(".WZIVy"):
title = result.select_one(".cHaqb").text
platform = result.select_one("cite").text
chanel = result.select_one(".pcJO7e span").text.replace(" · ", "")
date = result.select_one(".hMJ0yc span").text
data_inline_video.append({
"title" : title,
"platform" : platform,
"chanel" : chanel,
"date" : date
})
print(json.dumps(data_inline_video, indent=2, ensure_ascii=False))
if video_page:
data_video_panel = []
html = requests.get("https://www.google.com/search", headers=headers, params=params, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
print("Video panel data:\n")
for products in soup.select("video-voyager"):
title = products.select_one(".DKV0Md").text
description = products.select_one(".Uroaid").text
link = products.select_one(".ct3b9e a")["href"]
chanel = products.select_one(".Zg1NU+ span").text
duration = products.select_one(".J1mWY div").text
date = products.select_one(".P7xzyf span span").text
data_video_panel.append({
"title" : title,
"description" : description,
"link" : link,
"chanel" : chanel,
"duration" : duration,
"date" : date
})
print(json.dumps(data_video_panel, indent=2, ensure_ascii=False))
scrape_google_videos(video_page=True, inline_videos=False)
Inline video data:
[
{
"title": "A Thousand Years of European Castles",
"platform": "YouTube",
"chanel": "Best Documentary",
"date": "Jan 25, 2022"
},
{
"title": "MOST BEAUTIFUL Castles on Earth",
"platform": "YouTube",
"chanel": "Top Fives",
"date": "Feb 2, 2022"
},
{
"title": "Great Castles of Europe: Neuschwanstein (Part 1 of 3)",
"platform": "YouTube",
"chanel": "trakehnen",
"date": "Sep 24, 2009"
}
]

Get a list of elements into separated arrays

Hello fellow developer out there,
I'm new to Python & I need to write a web scraper to catch info from Scholar Google.
I ended up coding this function to get values using Xpath:
thread = browser.find_elements(By.XPATH,(" %s" % exp))
xArray = []
for t in thread:
if not atr:
xThread = t.text
else:
xThread = t.get_attribute('href')
xArray.append(xThread)
return xArray
I don't know if it's a good or a bad solution. So, I humbly accept any suggestions to make it work better.
Anyway, my actual problem is that I am getting all authors name from the page I am scraping and what I really need are the names, grouped by result.
When I ask to print the results I wish I could have something like this:
[[author1, author2,author 3],[author 4,author 5,author6]]
What am I getting right now is:
[author1,author3,author4,author5,author6]
The structure is as follows:
<div class="gs_a">
LR Hisch,
AM Gobin
,AR Lowery,
F Tam
... -Annals of biomedical ...,2006 - Springer
</div>
And the same structure is repetead all over the page for different documents and authors.
And this is the call to the function I explained earlier:
authors = (clothoSpins(".//*[#class='gs_a']//a"))
Which gets me the entire list of authors.
Here is the logic (used selenium in the below code but update it as per your need).
Logic:
url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C21&q=python&btnG="
driver.get(url)
# get the authors and add to list
listBooks = []
books = driver.find_elements_by_xpath("//div[#class='gs_a']")
for bookNum in books:
auths = []
authors = driver.find_elements_by_xpath("(//div[#class='gs_a'])[%s]/a|(//div[#class='gs_a'])[%s]/self::*[not(a)]"%(bookNum+1,bookNum+1))
for author in authors:
auths.append(author.text)
listBooks.append(auths)
Output:
[['F Pedregosa', 'G Varoquaux', 'A Gramfort'], ['PD Adams', 'PV Afonine'], ['TE Oliphant'], ['JW Peirce'], ['S Anders', 'PT Pyl', 'W Huber'], ['MF Sanner'], ['S Bird', 'E Klein'], ['M Lutz - 2001 - books.google.com'], ['G Rossum - 1995 - dl.acm.org'], ['W McKinney - … of the 9th Python in Science Conference, 2010 - pdfs.semanticscholar.org']]
Screenshot:
To group by result you can create an empty list, iterate over results, and append extracted data to the list as a dict, and returned result could be serialized to a JSON string using json_dumps() method e.g:
temp_list = []
for result in results:
# extracting title, link, etc.
temp_list.append({
"title": title,
# other extracted elements
})
print(json.dumps(temp_list, indent=2))
"""
Returned results is a list of dictionaries:
[
{
"title": "A new biology for a new century",
# other extracted elements..
}
]
"""
Code and full example in the online IDE:
from parsel import Selector
import requests, json, re
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "biology", # search query
"hl": "en" # language
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}
html = requests.get("https://scholar.google.com/scholar", params=params, headers=headers, timeout=30)
selector = Selector(html.text)
data = []
for result in selector.css(".gs_ri"):
# xpath("normalize-space()") to get blank text nodes as well to get the full string output
title = result.css(".gs_rt a").xpath("normalize-space()").get()
# https://regex101.com/r/7bmx8h/1
authors = re.search(r"^(.*?)-", result.css(".gs_a").xpath("normalize-space()").get()).group(1).strip()
snippet = result.css(".gs_rs").xpath("normalize-space()").get()
# https://regex101.com/r/47erNR/1
year = re.search(r"\d+", result.css(".gs_a").xpath("normalize-space()").get()).group(0)
# https://regex101.com/r/13468d/1
publisher = re.search(r"\d+\s?-\s?(.*)", result.css(".gs_a").xpath("normalize-space()").get()).group(1)
cited_by = int(re.search(r"\d+", result.css(".gs_or_btn.gs_nph+ a::text").get()).group(0))
data.append({
"title": title,
"snippet": snippet,
"authors": authors,
"year": year,
"publisher": publisher,
"cited_by": cited_by
})
print(json.dumps(data, indent=2, ensure_ascii=False))
Output:
[
{
"title": "A new biology for a new century",
"snippet": "… A society that permits biology to become an engineering discipline, that allows that science … science of biology that helps us to do this, shows the way. An engineering biology might still …",
"authors": "CR Woese",
"year": "2004",
"publisher": "Am Soc Microbiol",
"cited_by": 743
}, ... other results
{
"title": "Campbell biology",
"snippet": "… Now, Campbell series Biology texts are institutionalized. This is the standard biology text across colleges in the US To say the authors and editors know what they are doing at this point …",
"authors": "JB Reece, LA Urry, ML Cain, SA Wasserman…",
"year": "2014",
"publisher": "fvsuol4ed.org",
"cited_by": 1184
}
]
Note: in the example above, I'm using parsel library which is very similar to beautifulsoup and selenium in terms of data extraction.
Alternatively, you can achieve the same thing by using Google Scholar Organic Results API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to create the parser from scratch, maintain it, figure out how to scale it without getting blocked.
Example code to integrate:
from serpapi import GoogleSearch
import os, json
params = {
"api_key": os.getenv("API_KEY"), # SerpApi API key
"engine": "google_scholar", # parsing engine
"q": "biology", # search query
"hl": "en" # language
}
search = GoogleSearch(params) # where data extraction happens
results = search.get_dict() # JSON -> Python dictionary
for result in results["organic_results"]:
print(json.dumps(result, indent=2))
Output:
{
"position": 0,
"title": "A new biology for a new century",
"result_id": "KNJ0p4CbwgoJ",
"link": "https://journals.asm.org/doi/abs/10.1128/MMBR.68.2.173-186.2004",
"snippet": "\u2026 A society that permits biology to become an engineering discipline, that allows that science \u2026 science of biology that helps us to do this, shows the way. An engineering biology might still \u2026",
"publication_info": {
"summary": "CR Woese - Microbiology and molecular biology reviews, 2004 - Am Soc Microbiol"
},
"resources": [
{
"title": "nih.gov",
"file_format": "HTML",
"link": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC419918/"
},
{
"title": "View it # CTU",
"link": "https://scholar.google.com/scholar?output=instlink&q=info:KNJ0p4CbwgoJ:scholar.google.com/&hl=en&as_sdt=0,11&scillfp=15047057806408271473&oi=lle"
}
],
"inline_links": {
"serpapi_cite_link": "https://serpapi.com/search.json?engine=google_scholar_cite&q=KNJ0p4CbwgoJ",
"html_version": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC419918/",
"cited_by": {
"total": 743,
"link": "https://scholar.google.com/scholar?cites=775353062728716840&as_sdt=80005&sciodt=0,11&hl=en",
"cites_id": "775353062728716840",
"serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=80005&cites=775353062728716840&engine=google_scholar&hl=en"
},
"related_pages_link": "https://scholar.google.com/scholar?q=related:KNJ0p4CbwgoJ:scholar.google.com/&scioq=biology&hl=en&as_sdt=0,11",
"versions": {
"total": 20,
"link": "https://scholar.google.com/scholar?cluster=775353062728716840&hl=en&as_sdt=0,11",
"cluster_id": "775353062728716840",
"serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=0%2C11&cluster=775353062728716840&engine=google_scholar&hl=en"
}
}
}
{
"position": 9,
"title": "Campbell biology",
"result_id": "YnWp49O_RTMJ",
"type": "Book",
"link": "http://www.fvsuol4ed.org/reviews/Biology%20Organismal%20Template_Campbell%20Biology_Moran.pdf",
"snippet": "\u2026 Now, Campbell series Biology texts are institutionalized. This is the standard biology text across colleges in the US To say the authors and editors know what they are doing at this point \u2026",
"publication_info": {
"summary": "JB Reece, LA Urry, ML Cain, SA Wasserman\u2026 - 2014 - fvsuol4ed.org"
},
"resources": [
{
"title": "fvsuol4ed.org",
"file_format": "PDF",
"link": "http://www.fvsuol4ed.org/reviews/Biology%20Organismal%20Template_Campbell%20Biology_Moran.pdf"
}
],
"inline_links": {
"serpapi_cite_link": "https://serpapi.com/search.json?engine=google_scholar_cite&q=YnWp49O_RTMJ",
"cited_by": {
"total": 1184,
"link": "https://scholar.google.com/scholar?cites=3694569986105898338&as_sdt=80005&sciodt=0,11&hl=en",
"cites_id": "3694569986105898338",
"serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=80005&cites=3694569986105898338&engine=google_scholar&hl=en"
},
"related_pages_link": "https://scholar.google.com/scholar?q=related:YnWp49O_RTMJ:scholar.google.com/&scioq=biology&hl=en&as_sdt=0,11",
"versions": {
"total": 33,
"link": "https://scholar.google.com/scholar?cluster=3694569986105898338&hl=en&as_sdt=0,11",
"cluster_id": "3694569986105898338",
"serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=0%2C11&cluster=3694569986105898338&engine=google_scholar&hl=en"
},
"cached_page_link": "http://scholar.googleusercontent.com/scholar?q=cache:YnWp49O_RTMJ:scholar.google.com/+biology&hl=en&as_sdt=0,11"
}
}
If you need to parse data from all Google Scholar Organic results, there's a dedicated Scrape historic 2017-2021 Organic, Cite Google Scholar results to CSV, SQLite blog post of mine at SerpApi that shows how to do it with API.
Disclaimer, I work for SerpApi.

Categories