Hello fellow developer out there,
I'm new to Python & I need to write a web scraper to catch info from Scholar Google.
I ended up coding this function to get values using Xpath:
thread = browser.find_elements(By.XPATH,(" %s" % exp))
xArray = []
for t in thread:
if not atr:
xThread = t.text
else:
xThread = t.get_attribute('href')
xArray.append(xThread)
return xArray
I don't know if it's a good or a bad solution. So, I humbly accept any suggestions to make it work better.
Anyway, my actual problem is that I am getting all authors name from the page I am scraping and what I really need are the names, grouped by result.
When I ask to print the results I wish I could have something like this:
[[author1, author2,author 3],[author 4,author 5,author6]]
What am I getting right now is:
[author1,author3,author4,author5,author6]
The structure is as follows:
<div class="gs_a">
LR Hisch,
AM Gobin
,AR Lowery,
F Tam
... -Annals of biomedical ...,2006 - Springer
</div>
And the same structure is repetead all over the page for different documents and authors.
And this is the call to the function I explained earlier:
authors = (clothoSpins(".//*[#class='gs_a']//a"))
Which gets me the entire list of authors.
Here is the logic (used selenium in the below code but update it as per your need).
Logic:
url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C21&q=python&btnG="
driver.get(url)
# get the authors and add to list
listBooks = []
books = driver.find_elements_by_xpath("//div[#class='gs_a']")
for bookNum in books:
auths = []
authors = driver.find_elements_by_xpath("(//div[#class='gs_a'])[%s]/a|(//div[#class='gs_a'])[%s]/self::*[not(a)]"%(bookNum+1,bookNum+1))
for author in authors:
auths.append(author.text)
listBooks.append(auths)
Output:
[['F Pedregosa', 'G Varoquaux', 'A Gramfort'], ['PD Adams', 'PV Afonine'], ['TE Oliphant'], ['JW Peirce'], ['S Anders', 'PT Pyl', 'W Huber'], ['MF Sanner'], ['S Bird', 'E Klein'], ['M Lutz - 2001 - books.google.com'], ['G Rossum - 1995 - dl.acm.org'], ['W McKinney - … of the 9th Python in Science Conference, 2010 - pdfs.semanticscholar.org']]
Screenshot:
To group by result you can create an empty list, iterate over results, and append extracted data to the list as a dict, and returned result could be serialized to a JSON string using json_dumps() method e.g:
temp_list = []
for result in results:
# extracting title, link, etc.
temp_list.append({
"title": title,
# other extracted elements
})
print(json.dumps(temp_list, indent=2))
"""
Returned results is a list of dictionaries:
[
{
"title": "A new biology for a new century",
# other extracted elements..
}
]
"""
Code and full example in the online IDE:
from parsel import Selector
import requests, json, re
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "biology", # search query
"hl": "en" # language
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}
html = requests.get("https://scholar.google.com/scholar", params=params, headers=headers, timeout=30)
selector = Selector(html.text)
data = []
for result in selector.css(".gs_ri"):
# xpath("normalize-space()") to get blank text nodes as well to get the full string output
title = result.css(".gs_rt a").xpath("normalize-space()").get()
# https://regex101.com/r/7bmx8h/1
authors = re.search(r"^(.*?)-", result.css(".gs_a").xpath("normalize-space()").get()).group(1).strip()
snippet = result.css(".gs_rs").xpath("normalize-space()").get()
# https://regex101.com/r/47erNR/1
year = re.search(r"\d+", result.css(".gs_a").xpath("normalize-space()").get()).group(0)
# https://regex101.com/r/13468d/1
publisher = re.search(r"\d+\s?-\s?(.*)", result.css(".gs_a").xpath("normalize-space()").get()).group(1)
cited_by = int(re.search(r"\d+", result.css(".gs_or_btn.gs_nph+ a::text").get()).group(0))
data.append({
"title": title,
"snippet": snippet,
"authors": authors,
"year": year,
"publisher": publisher,
"cited_by": cited_by
})
print(json.dumps(data, indent=2, ensure_ascii=False))
Output:
[
{
"title": "A new biology for a new century",
"snippet": "… A society that permits biology to become an engineering discipline, that allows that science … science of biology that helps us to do this, shows the way. An engineering biology might still …",
"authors": "CR Woese",
"year": "2004",
"publisher": "Am Soc Microbiol",
"cited_by": 743
}, ... other results
{
"title": "Campbell biology",
"snippet": "… Now, Campbell series Biology texts are institutionalized. This is the standard biology text across colleges in the US To say the authors and editors know what they are doing at this point …",
"authors": "JB Reece, LA Urry, ML Cain, SA Wasserman…",
"year": "2014",
"publisher": "fvsuol4ed.org",
"cited_by": 1184
}
]
Note: in the example above, I'm using parsel library which is very similar to beautifulsoup and selenium in terms of data extraction.
Alternatively, you can achieve the same thing by using Google Scholar Organic Results API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to create the parser from scratch, maintain it, figure out how to scale it without getting blocked.
Example code to integrate:
from serpapi import GoogleSearch
import os, json
params = {
"api_key": os.getenv("API_KEY"), # SerpApi API key
"engine": "google_scholar", # parsing engine
"q": "biology", # search query
"hl": "en" # language
}
search = GoogleSearch(params) # where data extraction happens
results = search.get_dict() # JSON -> Python dictionary
for result in results["organic_results"]:
print(json.dumps(result, indent=2))
Output:
{
"position": 0,
"title": "A new biology for a new century",
"result_id": "KNJ0p4CbwgoJ",
"link": "https://journals.asm.org/doi/abs/10.1128/MMBR.68.2.173-186.2004",
"snippet": "\u2026 A society that permits biology to become an engineering discipline, that allows that science \u2026 science of biology that helps us to do this, shows the way. An engineering biology might still \u2026",
"publication_info": {
"summary": "CR Woese - Microbiology and molecular biology reviews, 2004 - Am Soc Microbiol"
},
"resources": [
{
"title": "nih.gov",
"file_format": "HTML",
"link": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC419918/"
},
{
"title": "View it # CTU",
"link": "https://scholar.google.com/scholar?output=instlink&q=info:KNJ0p4CbwgoJ:scholar.google.com/&hl=en&as_sdt=0,11&scillfp=15047057806408271473&oi=lle"
}
],
"inline_links": {
"serpapi_cite_link": "https://serpapi.com/search.json?engine=google_scholar_cite&q=KNJ0p4CbwgoJ",
"html_version": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC419918/",
"cited_by": {
"total": 743,
"link": "https://scholar.google.com/scholar?cites=775353062728716840&as_sdt=80005&sciodt=0,11&hl=en",
"cites_id": "775353062728716840",
"serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=80005&cites=775353062728716840&engine=google_scholar&hl=en"
},
"related_pages_link": "https://scholar.google.com/scholar?q=related:KNJ0p4CbwgoJ:scholar.google.com/&scioq=biology&hl=en&as_sdt=0,11",
"versions": {
"total": 20,
"link": "https://scholar.google.com/scholar?cluster=775353062728716840&hl=en&as_sdt=0,11",
"cluster_id": "775353062728716840",
"serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=0%2C11&cluster=775353062728716840&engine=google_scholar&hl=en"
}
}
}
{
"position": 9,
"title": "Campbell biology",
"result_id": "YnWp49O_RTMJ",
"type": "Book",
"link": "http://www.fvsuol4ed.org/reviews/Biology%20Organismal%20Template_Campbell%20Biology_Moran.pdf",
"snippet": "\u2026 Now, Campbell series Biology texts are institutionalized. This is the standard biology text across colleges in the US To say the authors and editors know what they are doing at this point \u2026",
"publication_info": {
"summary": "JB Reece, LA Urry, ML Cain, SA Wasserman\u2026 - 2014 - fvsuol4ed.org"
},
"resources": [
{
"title": "fvsuol4ed.org",
"file_format": "PDF",
"link": "http://www.fvsuol4ed.org/reviews/Biology%20Organismal%20Template_Campbell%20Biology_Moran.pdf"
}
],
"inline_links": {
"serpapi_cite_link": "https://serpapi.com/search.json?engine=google_scholar_cite&q=YnWp49O_RTMJ",
"cited_by": {
"total": 1184,
"link": "https://scholar.google.com/scholar?cites=3694569986105898338&as_sdt=80005&sciodt=0,11&hl=en",
"cites_id": "3694569986105898338",
"serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=80005&cites=3694569986105898338&engine=google_scholar&hl=en"
},
"related_pages_link": "https://scholar.google.com/scholar?q=related:YnWp49O_RTMJ:scholar.google.com/&scioq=biology&hl=en&as_sdt=0,11",
"versions": {
"total": 33,
"link": "https://scholar.google.com/scholar?cluster=3694569986105898338&hl=en&as_sdt=0,11",
"cluster_id": "3694569986105898338",
"serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=0%2C11&cluster=3694569986105898338&engine=google_scholar&hl=en"
},
"cached_page_link": "http://scholar.googleusercontent.com/scholar?q=cache:YnWp49O_RTMJ:scholar.google.com/+biology&hl=en&as_sdt=0,11"
}
}
If you need to parse data from all Google Scholar Organic results, there's a dedicated Scrape historic 2017-2021 Organic, Cite Google Scholar results to CSV, SQLite blog post of mine at SerpApi that shows how to do it with API.
Disclaimer, I work for SerpApi.
Related
I'm triying get data from google maps with python and BeautifulSoup. For example pharmacies in a city. I will get location data (lat-lon), name of pharmacy(ie, MDC Pharmacy), score of pharmcy(3.2), number of rewiews(10), addres with zip code, and phone number of pharmacy.
I have tried python and BeautifulSoup but I'm stuck because I don't know how to extract the data. Class method isn't working. When I prettifing and printing to the results I have seen all of data. So how can I clean them for a pandas data frame? I need more codes both for clean data and add them a list or df. Also classobject turning noobject type. Here my codes:
import requests
from bs4 import BeautifulSoup
r=requests.get("https://www.google.com.tr/maps/search/eczane/#37.4809437,36.7749346,57378m/data=!3m1!1e3")
soup= BeautifulSoup(r.content,"lxml")
a=soup.prettify()
l=soup.find("div",{"class":"mapsConsumerUiSubviewSectionGm2Placeresultcontainer__result-container mapsConsumerUiSubviewSectionGm2Placeresultcontainer__one-action mapsConsumerUiSubviewSectionGm2Placeresultcontainer__wide-margin"})
print(a)
Printresult.jpg
I have this result I need extract data from here (above).
I want a result like this table (below). Thanks...
wanted resul(it is just a sample)
You don't need selenium for this. You don't even need BeautifulSoup (in fact, it doesn't help at all). Here is code that fetches the page, isolates the initialization data JSON, decodes it, and prints the resulting Python structure.
You would need to print out the structure, and start doing some counting to find the data you want, but it's all here.
import requests
import json
from pprint import pprint
r=requests.get("https://www.google.com.tr/maps/search/eczane/#37.4809437,36.7749346,57378m/data=!3m1!1e3")
txt = r.text
find1 = "window.APP_INITIALIZATION_STATE="
find2 = ";window.APP"
i1 = txt.find(find1)
i2 = txt.find(find2, i1+1 )
js = txt[i1+len(find1):i2]
data = json.loads(js)
pprint(data)
It might be also worth looking into a third party solution like SerpApi. It's a paid API with a free trial.
Example python code (available in other libraries also):
from serpapi import GoogleSearch
params = {
"api_key": "secret_api_key",
"engine": "google_maps",
"q": "eczane",
"google_domain": "google.com",
"hl": "en",
"ll": "#37.5393407,36.707705,11z",
"type": "search"
}
search = GoogleSearch(params)
results = search.get_dict()
Example JSON output:
"local_results": [
{
"position": 1,
"title": "Ocak Eczanesi",
"place_id": "ChIJcRipbonnLRUR4DG-UuCnB2I",
"data_id": "0x152de7896ea91871:0x6207a7e052be31e0",
"data_cid": "7063799122456621536",
"reviews_link": "https://serpapi.com/search.json?data_id=0x152de7896ea91871%3A0x6207a7e052be31e0&engine=google_maps_reviews&hl=en",
"photos_link": "https://serpapi.com/search.json?data_id=0x152de7896ea91871%3A0x6207a7e052be31e0&engine=google_maps_photos&hl=en",
"gps_coordinates": {
"latitude": 37.5775156,
"longitude": 36.957789399999996
},
"place_id_search": "https://serpapi.com/search.json?data=%214m5%213m4%211s0x152de7896ea91871%3A0x6207a7e052be31e0%218m2%213d37.5775156%214d36.957789399999996&engine=google_maps&google_domain=google.com&hl=en&type=place",
"rating": 3.5,
"reviews": 8,
"type": "Drug store",
"address": "Kanuni Mh. Milcan Cd. Pk:46100 Merkez, 46100 Dulkadiroğlu/Kahramanmaraş, Turkey",
"open_state": "Closes soon ⋅ 6PM ⋅ Opens 8:30AM Fri",
"hours": "Closing soon: 6:00 PM",
"phone": "+90 344 231 68 00",
"website": "https://kahramanmaras.bel.tr/nobetci-eczaneler",
"thumbnail": "https://lh5.googleusercontent.com/p/AF1QipN5CQRdoKc_BdCgSDiEdi0nEkk1X_VUy1PP4wN3=w93-h92-k-no"
},
{
"position": 2,
"title": "Nobetci eczane",
"place_id": "ChIJP4eh2WndLRURD6IcnOov0dA",
"data_id": "0x152ddd69d9a1873f:0xd0d12fea9c1ca20f",
"data_cid": "15046860514709512719",
"reviews_link": "https://serpapi.com/search.json?data_id=0x152ddd69d9a1873f%3A0xd0d12fea9c1ca20f&engine=google_maps_reviews&hl=en",
"photos_link": "https://serpapi.com/search.json?data_id=0x152ddd69d9a1873f%3A0xd0d12fea9c1ca20f&engine=google_maps_photos&hl=en",
"gps_coordinates": {
"latitude": 37.591462,
"longitude": 36.8847051
},
"place_id_search": "https://serpapi.com/search.json?data=%214m5%213m4%211s0x152ddd69d9a1873f%3A0xd0d12fea9c1ca20f%218m2%213d37.591462%214d36.8847051&engine=google_maps&google_domain=google.com&hl=en&type=place",
"rating": 3.3,
"reviews": 12,
"type": "Pharmacy",
"address": "Mimar Sinan, 48007. Sk. No:19, 46050 Kahramanmaraş Merkez/Kahramanmaraş, Turkey",
"open_state": "Open now",
"thumbnail": "https://lh5.googleusercontent.com/p/AF1QipNznf-hC_y9KdijwUMqdO9YIcn7rbN8ZQpdIHK5=w163-h92-k-no"
},
...
]
Check out the documentation for more details.
Disclaimer: I work at SerpApi.
I'm scraping data from the following API: https://content.osu.edu/v2/classes/search?q=&campus=col&academic-career=ugrd
The JSON format looks like:
{
"data":{
"totalItems":10000,
"currentItemCount":200,
"page":1,
"totalPages":50,
"refineQueryTemplate":"q=_QUERY_&campus=col&academic-career=ugrd&p=1",
"nextPageLink":"?q=&campus=col&academic-career=ugrd&p=2",
"prevPageLink":null,
"activeSort":"",
"courses":[
{
"course":{
"term":"Summer 2021",
"effectiveDate":"2019-01-06",
"effectiveStatus":"A",
"title":"Dental Hygiene Practicum",
"shortDescription":"DHY Practicum",
"description":"Supervised practice outside the traditional clinic in a setting similar to one in which the dental hygiene student may practice, teach, or conduct research upon graduation.\nPrereq: Sr standing in DHY or BDCP major. Repeatable to a maximum of 4 cr hrs or 4 completions. This course is graded S/U.",
"equivalentId":"S1989",
"allowMultiEnroll":"N",
"maxUnits":4,
"minUnits":1,
"repeatUnitsLimit":4,
"grading":"Satisfactory/Unsatisfactory",
"component":"Field Experience",
"primaryComponent":"FLD",
"offeringNumber":"1",
"academicGroup":"Dentistry",
"subject":"DENTHYG",
"catalogNumber":"4430",
"campus":"Columbus",
"academicOrg":"D2120",
"academicCareer":"Undergraduate",
"cipCode":"51.0602",
"courseAttributes":[
{
"name":"CCP",
"value":"NON-CCP",
"description":"Not eligible for College Credit Plus program"
}
],
"campusCode":"COL",
"catalogLevel":"4xxx",
"subjectDesc":"Dental Hygiene",
"courseId":"152909"
},
"sections":[
{
"classNumber":"20850",
"section":"10",
"component":"Field Experience",
"instructionMode":"Distance Learning",
"meetings":[
{
"meetingNumber":1,
"facilityId":null,
"facilityType":null,
"facilityDescription":null,
"facilityDescriptionShort":null,
"facilityGroup":null,
"facilityCapacity":0,
"buildingCode":null,
"room":null,
"buildingDescription":null,
"buildingDescriptionShort":null,
"startTime":null,
"endTime":null,
"startDate":"2021-05-12",
"endDate":"2021-07-30",
"monday":false,
"tuesday":false,
"wednesday":false,
"thursday":false,
"friday":false,
"saturday":false,
"sunday":false,
"standingMeetingPattern":null,
"instructors":[
{
"displayName":"Irina A Novopoltseva",
"role":"PI",
"email":"novopoltseva.1#osu.edu"
}
]
}
],
"courseOfferingNumber":1,
"courseId":"152909",
"academicGroup":"DEN",
"subject":"Dental Hygiene",
"catalogNumber":"4430",
"career":"UGRD",
"description":"DHY Practicum",
"enrollmentStatus":"Open",
"status":"A",
"type":"E",
"associatedClass":"10",
"autoEnrollWaitlist":true,
"autoEnrollSection1":null,
"autoEnrollSection2":null,
"consent":"D",
"waitlistCapacity":5,
"minimumEnrollment":0,
"enrollmentTotal":1,
"waitlistTotal":0,
"academicOrg":"D2120",
"location":"CS-COLMBUS",
"equivalentCourseId":null,
"startDate":"2021-05-12",
"endDate":"2021-07-30",
"cancelDate":null,
"primaryInstructorSection":"10",
"combinedSection":null,
"holidaySchedule":"OSUSIS",
"sessionCode":"1S",
"sessionDescription":"Summer Term",
"term":"Summer 2021",
"campus":"Columbus",
"attributes":[
{
"name":"CCP",
"value":"NON-CCP",
"description":"Not eligible for College Credit Plus program"
}
],
"secCampus":"COL",
"secAcademicGroup":"DEN",
"secCatalogNumber":"4430",
"meetingDays":"",
"_parent":"152909-1-1214",
"subjectDesc":"Dental Hygiene",
"courseTitle":"Dental Hygiene Practicum",
"courseDescription":"Supervised practice outside the traditional clinic in a setting similar to one in which the dental hygiene student may practice, teach, or conduct research upon graduation.\nPrereq: Sr standing in DHY or BDCP major. Repeatable to a maximum of 4 cr hrs or 4 completions. This course is graded S/U.",
"catalogLevel":"4xxx",
"termCode":"1214"
}
]
},
{
"course":{
"term":"Spring 2021",
"effectiveDate":"2020-08-24",
"effectiveStatus":"A",
"title":"Undergraduate Research in Public Health",
"shortDescription":"Res Pub Hlth",
"description":"Undergraduate research under the guidance of a faculty mentor in a basic or applied area of public health.\nPrereq: Jr or Sr standing, and enrollment in BSPH major, and permission of advisor. Students who are not junior or senior standing may be eligible with faculty mentor approval. Repeatable to a maximum of 6 cr hrs. This course is graded S/U.",
"equivalentId":"",
"allowMultiEnroll":"N",
"maxUnits":6,
"minUnits":1,
"repeatUnitsLimit":6,
"grading":"Satisfactory/Unsatisfactory",
"component":"Independent Study",
"primaryComponent":"IND",
"offeringNumber":"1",
"subject":"PUBHLTH",
"catalogNumber":"4998",
"campus":"Columbus",
"academicOrg":"D2505",
"academicCareer":"Undergraduate",
"cipCode":"51.2201",
"courseAttributes":[
{
"name":"CCP",
"value":"NON-CCP",
"description":"Not eligible for College Credit Plus program"
}
],
"campusCode":"COL",
"catalogLevel":"4xxx",
"subjectDesc":"Public Health",
"courseId":"160532"
},
"sections":[
{
"classNumber":"3557",
"section":"0030",
"component":"Independent Study",
"instructionMode":"In Person",
"meetings":[
{
"meetingNumber":1,
"facilityId":null,
"facilityType":null,
"facilityDescription":null,
"facilityDescriptionShort":null,
"facilityGroup":null,
"facilityCapacity":0,
"buildingCode":null,
"room":null,
"buildingDescription":null,
"buildingDescriptionShort":null,
"startTime":null,
"endTime":null,
"startDate":"2021-01-11",
"endDate":"2021-04-23",
"monday":false,
"tuesday":false,
"wednesday":false,
"thursday":false,
"friday":false,
"saturday":false,
"sunday":false,
"standingMeetingPattern":null,
"instructors":[
{
"displayName":"Abigail Norris Turner",
"role":"PI",
"email":"norris-turner.1#osu.edu"
}
]
}
],
"courseOfferingNumber":1,
"courseId":"160532",
"academicGroup":"PBH",
"subject":"Public Health",
"catalogNumber":"4998",
"career":"UGRD",
"description":"Res Pub Hlth",
"enrollmentStatus":"Open",
"status":"A",
"type":"E",
"associatedClass":"1",
"autoEnrollWaitlist":true,
"autoEnrollSection1":null,
"autoEnrollSection2":null,
"consent":"I",
"waitlistCapacity":99,
"minimumEnrollment":0,
"enrollmentTotal":0,
"waitlistTotal":0,
"academicOrg":"D2505",
"location":"CS-COLMBUS",
"equivalentCourseId":null,
"startDate":"2021-01-11",
"endDate":"2021-04-23",
"cancelDate":null,
"primaryInstructorSection":"0010",
"combinedSection":null,
"holidaySchedule":"OSUSIS",
"sessionCode":"1",
"sessionDescription":"Regular Academic Term",
"term":"Spring 2021",
"campus":"Columbus",
"attributes":[
{
"name":"CCP",
"value":"NON-CCP",
"description":"Not eligible for College Credit Plus program"
}
],
"secCampus":"COL",
"secAcademicGroup":"PBH",
"secCatalogNumber":"4998",
"meetingDays":"",
"_parent":"160532-1-1212",
"subjectDesc":"Public Health",
"courseTitle":"Undergraduate Research in Public Health",
"courseDescription":"Undergraduate research under the guidance of a faculty mentor in a basic or applied area of public health.\nPrereq: Jr or Sr standing, and enrollment in BSPH major, and permission of advisor. Students who are not junior or senior standing may be eligible with faculty mentor approval. Repeatable to a maximum of 6 cr hrs. This course is graded S/U.",
"catalogLevel":"4xxx",
"termCode":"1212"
}
]
},
{
"course":{
"term":"Spring 2021",
"effectiveDate":"2013-05-05",
"effectiveStatus":"A",
"title":"Individual Studies in Public Health",
"shortDescription":"Ind Study Pub Hlth" ```
But when I use this code to scrape the pages, it just repeats.
import requests
session = requests.Session()
def get_classes():
url = "https://content.osu.edu/v2/classes/search?q=&campus=col&academic- career=ugrd"
first_page = session.get(url).json()
yield first_page
num_pages = first_page['data']['totalPages']
for page in range(0, num_pages + 1):
next_page = session.get(url, params={'page': page}).json()
yield next_page
for page in get_classes():
data = page['data']['courses']
array_length = len(data)
for i in range(array_length):
if (i <= array_length):
course_key = data[i]['course']
subject = course_key['subject']
number = course_key['catalogNumber']
title = course_key['title']
units = course_key['minUnits']
component = course_key['component']
attributes = course_key['courseAttributes']
description = course_key['description']
else:
break
I want to scrape all the data from the page and then proceed to the next page until I scraped all the pages. Instead, it just prints the same page over and over again.
You can see the next page link in the response:
"nextPageLink":"?q=&campus=col&academic-career=ugrd&p=2",
So you should use p instead of page.
how can i return or select only those parameters that are needed in Python dict format. Not all of the parameters that are being return.
Here is the url we use:
https://api.nytimes.com/svc/search/v2/articlesearch.json?begin_date=20201020&facet=false&sort=newest&api-key=[YOUR_API_KEY]
Here is the response we get:
{
"status": "OK",
"copyright": "Copyright (c) 2020 The New York Times Company. All Rights Reserved.",
"response": {
"docs": [
{
"abstract": "Our latest survey shows a shift toward Biden among college-educated white voters, but surprising Trump gains among nonwhite voters.",
"web_url": "https://www.nytimes.com/2020/10/20/upshot/poll-georgia-biden-trump.html",
"snippet": "Our latest survey shows a shift toward Biden among college-educated white voters, but surprising Trump gains among nonwhite voters.",
"lead_paragraph": "A shift against President Trump among white college-educated voters in Georgia has imperiled Republicans up and down the ballot, according to a New York Times/Siena College survey on Tuesday, as Republicans find themselves deadlocked or trailing in Senate races where their party was once considered the heavy favorite.",
"source": "The New York Times",
"multimedia": [
{
"rank": 0,
"subtype": "xlarge",
"caption": null,
"credit": null,
"type": "image",
"url": "images/2020/10/20/us/undefined-promo-1603200878027/undefined-promo-1603200878027-articleLarge.jpg",
"height": 399,
"width": 600,
"legacy": {
"xlarge": "images/2020/10/20/us/undefined-promo-1603200878027/undefined-promo-1603200878027-articleLarge.jpg",
"xlargewidth": 600,
"xlargeheight": 399
},
"subType": "xlarge",
"crop_name": "articleLarge"
},
..........
How can i only return for example:
web_url and source parameters in Python?
Please help !!!
this is the code i use, but it returns all parameters:
import requests
import os
from pprint import pprint
apikey = os.getenv('VGSDRL9bWiWy70GdCPA4QX8flAsemVGJ', '...')
query_url = "https://api.nytimes.com/svc/search/v2/articlesearch.json?q=trump&sort=newest&api-key=VGSDRL9bWiWy70GdCPA4QX8flAsemVGJ"
r = requests.get(query_url)
pprint(r.json())
r = requests.get(query_url)
filtered = [{'web_url': d['web_url'], 'source': d['source']} for d in
r.json()['response']['docs']]
pprint(filtered)
I have a response that I receive from Lobbyview in the form of json. I tried to put it in data frame to access only some variables, but with no success. How can I access only some variables such as the id and the committees in a format exportable to .dta ? Here is the code I have tried.
import requests, json
query = {"naics": "424430"}
results = requests.post('https://www.lobbyview.org/public/api/reports',
data = json.dumps(query))
print(results.json())
import pandas as pd
b = pd.DataFrame(results.json())
_id = data["_id"]
committee = data["_source"]["specific_issues"][0]["bills_by_algo"][0]["committees"]
An observation of the json looks like this:
"_score": 4.421936,
"_type": "object",
"_id": "5EZUMbQp3hGKH8Uq2Vxuke",
"_source":
{
"issue_codes": ["CPT"],
"received": 1214320148,
"client_name": "INTELLECTUAL PROPERTY OWNERS ASSOCIATION",
"amount": 240000,
"client":
{
"legal_name": "INTELLECTUAL PROPERTY OWNERS ASSOCIATION",
"name": "INTELLECTUAL PROPERTY OWNERS ASSOCIATION",
"naics": null,
"gvkey": null,
"ticker": "Unlisted",
"id": null,
"bvdid": "US131283992L"},
"specific_issues": [
{
"text": "H.R. 34, H.R. 1908, H.R. 2336, H.R. 3093 S. 522, S. 681, S. 1145, S. 1745",
"bills_by_algo": [
{
"titles": ["To amend title 35, United States Code, to provide for patent reform.", "Patent Reform Act of 2007", "Patent Reform Act of 2007", "Patent Reform Act of 2007"],
"top_terms": ["Commerce", "Administrative fees"],
"sponsor":
{
"firstname": "Howard",
"district": 28,
"title": "rep",
"id": 400025
},
"committees": ["House Judiciary"],
"introduced": 1176868800,
"type": "HR", "id": "110_HR1908"},
{
"titles": ["To amend title 35, United States Code, relating to the funding of the United States Patent and Trademark Office."],
"top_terms": ["Commerce", "Administrative fees"],
"sponsor":
{
"firstname": "Howard",
"district": 28,
"title": "rep",
"id": 400025
},
"committees": ["House Judiciary"],
"introduced": 1179288000,
"type": "HR",
"id": "110_HR2336"
}],
"gov_entities": ["U.S. House of Representatives", "Patent and Trademark Office (USPTO)", "U.S. Senate", "UNDETERMINED", "U.S. Trade Representative (USTR)"],
"lobbyists": ["Valente, Thomas Silvio", "Wamsley, Herbert C"],
"year": 2007,
"issue": "CPT",
"id": "S4nijtRn9Q5NACAmbqFjvZ"}],
"year": 2007,
"is_latest_amendment": true,
"type": "MID-YEAR AMENDMENT",
"id": "1466CDCD-BA3D-41CE-B7A1-F9566573611A",
"alternate_name": "INTELLECTUAL PROPERTY OWNERS ASSOCIATION"
},
"_index": "collapsed"}```
Since the data that you specified is nested pretty deeply in the JSON-response, you have to loop through it and save it to a list temporarily. To understand the response data better, I would advice you to use some tool to look into the JSON structure, like this online JSON-Viewer. Not every entry in the JSON contains the necessary data, therefore I try to catch the error through a try and except. To make sure that the id and committees are matched correctly, I chose to add them as small dicts to the list. This list can then be read into Pandas with ease. Saving to .dta requires you to convert the lists inside the committees column to strings, instead you might also want to save as .csv for a more generally usable format.
import requests, json
import pandas as pd
query = {"naics": "424430"}
results = requests.post(
"https://www.lobbyview.org/public/api/reports", data=json.dumps(query)
)
json_response = results.json()["result"]
# to save the JSON response
# with open("data.json", "w") as outfile:
# json.dump(results.json()["result"], outfile)
resulting_data = []
# loop through the response
for data in json_response:
# try to find entries with specific issues, bills_by_algo and committees
try:
# loop through the special issues
for special_issue in data["specific_issues"]:
_id = special_issue["id"]
# loop through the bills_by_algo's
for x in special_issue["bills_by_algo"]:
# append the id and committees in a dict
resulting_data.append(({"id": _id, "committees": x["committees"]}))
except KeyError as e:
print(e, "not found in entry.")
continue
# create a DataFrame
df = pd.DataFrame(resulting_data)
# export of list objects in the column is not supported by .dta, therefore we convert
# to strings with ";" as delimiter
df["committees"] = ["; ".join(map(str, l)) for l in df["committees"]]
print(df)
df.to_stata("result.dta")
Results in
id committees
0 D8BxG5664FFb8AVc6KTphJ House Judiciary
1 D8BxG5664FFb8AVc6KTphJ Senate Judiciary
2 8XQE5wu3mU7qvVPDpUWaGP House Agriculture
3 8XQE5wu3mU7qvVPDpUWaGP Senate Agriculture, Nutrition, and Forestry
4 kzZRLAHdMK4YCUQtQAdCPY House Agriculture
.. ... ...
406 ZxXooeLGVAKec9W2i32hL5 House Agriculture
407 ZxXooeLGVAKec9W2i32hL5 Senate Agriculture, Nutrition, and Forestry; H...
408 ZxXooeLGVAKec9W2i32hL5 House Appropriations; Senate Appropriations
409 ahmmafKLfRP8wZay9o8GRf House Agriculture
410 ahmmafKLfRP8wZay9o8GRf Senate Agriculture, Nutrition, and Forestry
[411 rows x 2 columns]
I just want to grab public information from my accounts on two social media sites. (Instagram and Twitter) My code returns info for twitter, and I know the xpath is correct for instagram but for some reason i'm not getting data for it. I know the XPATH's could be more specific but I can fix that later. Both my accounts are public.
1) I thought maybe it didn't like the python header, so I tried changing it and I still get nothing. That line is commented out but its still there.
2) I heard something about an API on github, this lengthy code is very intimidating and way above my level of understanding. I don't know more than half of what i'm reading on there.
from lxml import html
import requests
import webbrowser
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
#page = requests.get('https://www.instagram.com/<my account>/', headers=headers)
page = requests.get('https://www.instagram.com/<my account>/')
tree = html.fromstring(page.text)
pageTwo = requests.get('http://www.twitter.com/<my account>')
treeTwo = html.fromstring(pageTwo.text)
instaFollowers = tree.xpath("//span[#data-reactid='.0.1.0.0:0.1.3.1.0']/span[2]/text()")
instaFollowing = tree.xpath("//span[#data-reactid='.0.1.0.0:0.1.3.2.0']/span[2]/text()")
twitFollowers = treeTwo.xpath("//a[#data-nav='followers']/span[#class='ProfileNav-value']/text()")
twitFollowing = treeTwo.xpath("//a[#data-nav='following']/span[#class='ProfileNav-value']/text()")
print ''
print '--------------------'
print 'Social Media Checker'
print '--------------------'
print ''
print 'Instagram: ' + str(instaFollowers) + ' / ' + str(instaFollowing)
print ''
print 'Twitter: ' + str(twitFollowers) + ' / ' + str(twitFollowing)
As mentioned, Instragram's page source does not reflect its rendered source as a Javascript function is called to pass content from JSON data to browser. Hence, what Python scrapes in page source does not show exactly what browser renders to screen. Welcome to the new world of dynamic web programming! Consider using Instagram's API or other web parser that can retrieve html generated content (not just page source).
With that said, if you simply need the IG account data you can still use Python's lxml to XPath the JSON content in <script> tag (specifically sixth occurrence but adjust to your needed page). Below example parses Google's Instagram JSON data:
import lxml.etree as et
import urllib.request as rq
rqpage = rq.urlopen('https://instagram.com/google')
txtpage = rqpage.read()
tree = et.HTML(txtpage)
jsondata = tree.xpath("//script[#type='text/javascript' and position()=6]/text()")
for i in jsondata:
print(i)
OUTPUT
window._sharedData = {"qs":"{\"shift\":10,\"header
\":\"n3bTdmHGHDgxvZYPN0KDFHqbkxd6zpTl\",\"edges\":100,\"blob
\":\"AQCq42rOTCnKOZcOxFn06L1J6_W8wY6ntAS1bX88VBClAjQD9PyJdefCzOwfSAbUdsBwHKb1QSndurPtjyN-
rHMOrZ_6ubE_Xpu908cyron9Zczkj4QMkAYUHIgnmmftuXG8rrFzq_Oq3BoXpQgovI9hefha-
6SAs1RLJMwMArrbMlFMLAwyd1TZhArcxQkk9bgRGT4MZK4Tk2VNt1YOKDN1pO3NJneFlUxdUJTdDX
zj3eY-stT7DnxF_GM_j6xwk1o\",\"iterations\":7,\"size\":42}","static_root":"
\/\/instagramstatic-a.akamaihd.net\/bluebar\/5829dff","entry_data":
{"ProfilePage":[{"__query_string":"?","__path":"\/google\/","__get_params":
{},"user":{"username":"google","has_blocked_viewer":false,"follows":
{"count":10},"requested_by_viewer":false,"followed_by":
{"count":977186},"country_block":null,"has_requested_viewer":false,"followed_
by_viewer":false,"follows_viewer":false,"profile_pic_url":"https:
\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xfp1\/t51.2885-19\/s150x150
\/11910217_933356470069152_115044571_a.jpg","is_private":false,"full_name":
"Google","media":{"count":180,"page_info":
{"has_previous_page":false,"start_cursor":"1126896719808871555","end_cursor":
"1092117490206686720","has_next_page":true},"nodes":[{"code":"-
jipiawryD","dimensions":{"width":640,"height":640},"owner":
{"id":"1067259270"},"comments":{"count":105},"caption":"Today's the day!
Your searches are served. Happy Thanksgiving \ud83c\udf57\ud83c\udf70
#GoogleTrends","likes":
{"count":11410},"date":1448556579.0,"thumbnail_src":"https:\/
\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\
/11848856_482502108621097_589421586_n.jpg","is_video":true,"id":"112689671980
8871555","display_src":"https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-
xat1\/t51.2885-15
...
JSON Pretty Print (extracting the window._sharedData variable from above)
See below where user (followers, following, etc.) data shows at beginning:
{
"qs": "{\"shift\":10,\"header\":\"n3bTdmHGHDgxvZYPN0KDFHqbkxd6zpTl\",\"edges\":100,\"blob\":\"AQCq42rOTCnKOZcOxFn06L1J6_W8wY6ntAS1bX88VBClAjQD9PyJdefCzOwfSAbUdsBwHKb1QSndurPtjyN-rHMOrZ_6ubE_Xpu908cyron9Zczkj4QMkAYUHIgnmmftuXG8rrFzq_Oq3BoXpQgovI9hefha-6SAs1RLJMwMArrbMlFMLAwyd1TZhArcxQkk9bgRGT4MZK4Tk2VNt1YOKDN1pO3NJneFlUxdUJTdDXzj3eY-stT7DnxF_GM_j6xwk1o\",\"iterations\":7,\"size\":42}",
"static_root": "\/\/instagramstatic-a.akamaihd.net\/bluebar\/5829dff",
"entry_data": {
"ProfilePage": [
{
"__query_string": "?",
"__path": "\/google\/",
"__get_params": {
},
"user": {
"username": "google",
"has_blocked_viewer": false,
"follows": {
"count": 10
},
"requested_by_viewer": false,
"followed_by": {
"count": 977186
},
"country_block": null,
"has_requested_viewer": false,
"followed_by_viewer": false,
"follows_viewer": false,
"profile_pic_url": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xfp1\/t51.2885-19\/s150x150\/11910217_933356470069152_115044571_a.jpg",
"is_private": false,
"full_name": "Google",
"media": {
"count": 180,
"page_info": {
"has_previous_page": false,
"start_cursor": "1126896719808871555",
"end_cursor": "1092117490206686720",
"has_next_page": true
},
"nodes": [
{
"code": "-jipiawryD",
"dimensions": {
"width": 640,
"height": 640
},
"owner": {
"id": "1067259270"
},
"comments": {
"count": 105
},
"caption": "Today's the day! Your searches are served. Happy Thanksgiving \ud83c\udf57\ud83c\udf70 #GoogleTrends",
"likes": {
"count": 11410
},
"date": 1448556579,
"thumbnail_src": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\/11848856_482502108621097_589421586_n.jpg",
"is_video": true,
"id": "1126896719808871555",
"display_src": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\/11848856_482502108621097_589421586_n.jpg"
},
{
"code": "-hwbf2wr0O",
"dimensions": {
"width": 640,
"height": 640
},
"owner": {
"id": "1067259270"
},
"comments": {
"count": 95
},
"caption": "Thanksgiving dinner is waiting. But first, the airport. \u2708\ufe0f #GoogleApp",
"likes": {
"count": 12621
},
...
IF anyone is interested in this sort of thing still, using selenium solved my problems.
http://pastebin.com/5eHeDt3r
Is there a faster way ?
In case you want to find information about yourself and others without hassling with code, try this piece of software. Apart from automatic scraping, it analyzes and visualizes the received information on a PDF report from such social networks: Facebook, Twitter, Instagram and from the Google Search engine.
P.S. I am the main developer and maintainer of this project.