Python scrapy yield to .json file not working

Python scrapy yield to .json file not working - python

I want to use Scrapy to extract the titles of different books in a url and output/store them as an array of dictionaries in a json file.
Here is my code:
import scrapy
class BooksSpider(scrapy.Spider):
name = "books"
star_urls = [
"http://books.toscrape.com"
]
def parse(self, response):
titles = response.css("article.product_pod h3 a::attr(title)").getall()
for title in titles:
yield {"title": title}
Here is what i put in the terminal:
scrapy crawl books -o books.json
The books.json file is created but is empty.
I checked that I was in the right directory and venv but it still doesn't work.
However:
Earlier, i deployed this spider to scrape the whole html data and write it to a books.html file and everything worked.
Here is my code for this:
import scrapy
class BooksSpider(scrapy.Spider):
name = "books"
star_urls = [
"http://books.toscrape.com"
]
def parse(self, response):
with open("books.html", "wb") as file:
file.write(response.body)
and here is what I put in my terminal:
scrapy crawl books
Any ideas on what I'm doing wrong? Thanks
Edit:
inputing response.css('article.product_pod h3 a::attr(title)').getall()
into the scrapy shell outputs:
['A Light in the Attic', 'Tipping the Velvet', 'Soumission', 'Sharp Objects', 'Sapiens: A Brief History of Humankind', 'The Requiem Red', 'The Dirty Little Secrets of Getting Your Dream Job', 'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull', 'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics', 'The Black Maria', 'Starving Hearts (Triangular Trade Trilogy, #1)', "Shakespeare's Sonnets", 'Set Me Free', "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", 'Rip it Up and Start Again', 'Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991', 'Olio', 'Mesaerion: The Best Science Fiction Stories 1800-1849', 'Libertarianism for Beginners', "It's Only the Himalayas"]

Now run the code.It should work
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://books.toscrape.com/']
def parse(self, response):
titles = response.css('.product_pod')
for title in titles:
yield {
"title": title.css('h3 a::attr(title)').get()
#"title": title.css('h3 a::text').get()
}
Output:
[
{
"title": "A Light in the Attic"
},
{
"title": "Tipping the Velvet"
},
{
"title": "Soumission"
},
{
"title": "Sharp Objects"
},
{
"title": "Sapiens: A Brief History of Humankind"
},
{
"title": "The Requiem Red"
},
{
"title": "The Dirty Little Secrets of Getting Your Dream Job"
},
{
"title": "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull"
},
{
"title": "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics"
},
{
"title": "The Black Maria"
},
{
"title": "Starving Hearts (Triangular Trade Trilogy, #1)"
},
{
"title": "Shakespeare's Sonnets"
},
{
"title": "Set Me Free"
},
{
"title": "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)"
},
{
"title": "Rip it Up and Start Again"
},
{
"title": "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991"
},
{
"title": "Olio"
},
{
"title": "Mesaerion: The Best Science Fiction Stories 1800-1849"
},
{
"title": "Libertarianism for Beginners"
},
{
"title": "It's Only the Himalayas"
}
]

Related

How to write a JSONL file in Python

I am trying to write a .jsonl file that needs to look like this:
{"file_name": "0001.png", "text": "This is a golden retriever playing with a ball"}
{"file_name": "0002.png", "text": "A german shepherd"}
{"file_name": "0003.png", "text": "One chihuahua"}
This is my attempt:
import json
import pandas as pd
dt = pd.read_csv('data.csv')
df = pd.DataFrame(dt)
file_name = df['image']
file_caption = df['text']
data = []
for i in range(len(file_name)):
entry = {"file_name": file_name[i], "text": file_caption[i]}
data.append(entry)
json_object = json.dumps(data, indent=4)
# Writing to sample.json
with open("metadata.jsonl", "w") as outfile:
outfile.write(json_object)
But this is the output I get:
[
{
"file_name": "images/image_0.jpg",
"text": "Fattoush Salad with Roasted Potatoes"
},
{
"file_name": "images/image_1.jpg",
"text": "an analysis of self portrayal in novels by virginia woolf A room of one's own study guide contains a biography of virginia woolf, literature essays, quiz questions, major themes, characters, and a full summary and analysis about a room of one's own a room of one's own summary."
},
{
"file_name": "images/image_2.jpg",
"text": "Christmas Comes Early to U.K. Weekly Home Entertainment Chart"
},
{
"file_name": "images/image_3.jpg",
"text": "Amy Garcia Wikipedia a legacy of reform: dorothea dix (1802\u20131887) | states of"
},
{
"file_name": "images/image_4.jpg",
"text": "3D Metal Cornish Harbour Painting"
},
{
"file_name": "images/image_5.jpg",
"text": "\"In this undated photo provided by the New York City Ballet, Robert Fairchild performs in \"\"In Creases\"\" by choreographer Justin Peck which is being performed by the New York City Ballet in New York. (AP Photo/New York City Ballet, Paul Kolnik)\""
},
...
]
I know that its because I am dumping a list so I know where I'm going wrong but how do I create a .jsonl file like the format above?

Don't indent the generated JSON and don't append it to a list. Just write out each line to the file:
import json
import pandas as pd
df = pd.DataFrame([['0001.png', "This is a golden retriever playing with a ball"],
['0002.png', "A german shepherd"],
['0003.png', "One chihuahua"]], columns=['filename','text'])
with open("metadata.jsonl", "w") as outfile:
for file, caption in zip(df['filename'], df['text']):
entry = {"file_name": file, "text": caption}
print(json.dumps(entry), file=outfile)
Output:
{"file_name": "0001.png", "text": "This is a golden retriever playing with a ball"}
{"file_name": "0002.png", "text": "A german shepherd"}
{"file_name": "0003.png", "text": "One chihuahua"}

Can't stratify output based on different headings and their corresponding paragraphs

I'm trying to fetch each heading and their corresponding paragraphs from the html elements below. The results should be stored within a dictionary. Whatever I've tried so far produces ludicrously haphazard output. I intentionally did not paste the current output only because of brevity of space.
html = """
<h1>a Complexity Profile</h1>
<p>Since time immemorial humans have...</p>
<p>How often have we been told</p>
<h2>INDIVIDUAL AND COLLECTIVE BEHAVIOR</h2>
<p>Building a model of society based...</p>
<p>All macroscopic systems...</p>
<h3>COMPLEXITY PROFILE</h3>
<p>It is much easier to think about the...</p>
<p>A formal definition of scale considers...</p>
<p>The complexity profile counts...</p>
<h2>CONTROL IN HUMAN ORGANIZATIONS</h2>
<p>Using this argument it is straightforward...</p>
<h2>Conclusion</h2>
<p>There are two natural conclusions...</p>
"""
I've tried with (producing messy output):
import json
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
data = []
for item in soup.select("h1,h2,h3,h4,h5,h6"):
d = {}
d['title'] = item.text
d['text'] = [i.text for i in item.find_next_siblings('p')]
data.append(d)
print(json.dumps(data,indent=4))
Output I wish to get:
[
{
"title": "a Complexity Profile",
"text": [
"Since time immemorial humans have...",
"How often have we been told",
]
},
{
"title": "INDIVIDUAL AND COLLECTIVE BEHAVIOR",
"text": [
"Building a model of society based...",
"All macroscopic systems...",
]
},
{
"title": "COMPLEXITY PROFILE",
"text": [
"It is much easier to think about the...",
"A formal definition of scale considers...",
"The complexity profile counts...",
]
},
{
"title": "CONTROL IN HUMAN ORGANIZATIONS",
"text": [
"Using this argument it is straightforward...",
]
},
{
"title": "Conclusion",
"text": [
"There are two natural conclusions..."
]
}
]

You can use find_previous() to check if you're in right "section":
import re
import json
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
r = re.compile(r"h\d+", flags=re.I)
data = []
for h in soup.find_all(name=r):
data.append({"title": h.get_text(strip=True), "text": []})
p = h.find_next_sibling("p")
while p and p.find_previous(name=r) == h:
data[-1]["text"].append(p.get_text())
p = p.find_next_sibling("p")
print(json.dumps(data, indent=4))
Prints:
[
{
"title": "a Complexity Profile",
"text": [
"Since time immemorial humans have...",
"How often have we been told"
]
},
{
"title": "INDIVIDUAL AND COLLECTIVE BEHAVIOR",
"text": [
"Building a model of society based...",
"All macroscopic systems..."
]
},
{
"title": "COMPLEXITY PROFILE",
"text": [
"It is much easier to think about the...",
"A formal definition of scale considers...",
"The complexity profile counts..."
]
},
{
"title": "CONTROL IN HUMAN ORGANIZATIONS",
"text": [
"Using this argument it is straightforward..."
]
},
{
"title": "Conclusion",
"text": [
"There are two natural conclusions..."
]
}
]
EDIT: Good idea is using itertools.takewhile:
import re
import json
from bs4 import BeautifulSoup
from itertools import takewhile
soup = BeautifulSoup(html, "lxml")
r = re.compile(r"h\d+", flags=re.I)
data = []
for h in soup.find_all(name=r):
data.append({"title": h.get_text(strip=True), "text": []})
for p in takewhile(lambda p: p.find_previous(name=r) == h, h.find_next_siblings("p")):
data[-1]["text"].append(p.get_text())
print(json.dumps(data, indent=4))
Using just list comprehension:
data = [
{
"title": h.get_text(strip=True),
"text": [
p.get_text()
for p in takewhile(
lambda p: p.find_previous(name=r) == h,
h.find_next_siblings("p"),
)
],
}
for h in soup.find_all(name=r)
]

Tricky problem. I think you have to handle things linearly:
import json
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
data = []
pending = {}
for item in soup.select("h1,h2,h3,h4,h5,h6,p"):
if item.name == 'p':
pending['text'].append( item.text )
else:
if pending:
data.append(pending)
pending = {'title': item.text, 'text': [] }
data.append( pending )
print(json.dumps(data,indent=4))
Output:
timr#tims-gram:~/src$ python x.py
[
{
"title": "a Complexity Profile",
"text": [
"Since time immemorial humans have...",
"How often have we been told"
]
},
{
"title": "INDIVIDUAL AND COLLECTIVE BEHAVIOR",
"text": [
"Building a model of society based...",
"All macroscopic systems..."
]
},
{
"title": "COMPLEXITY PROFILE",
"text": [
"It is much easier to think about the...",
"A formal definition of scale considers...",
"The complexity profile counts..."
]
},
{
"title": "CONTROL IN HUMAN ORGANIZATIONS",
"text": [
"Using this argument it is straightforward..."
]
},
{
"title": "Conclusion",
"text": [
"There are two natural conclusions..."
]
}
]
timr#tims-gram:~/src$

for loop keeps repeating when trying to scrape API to next page

I'm scraping data from the following API: https://content.osu.edu/v2/classes/search?q=&campus=col&academic-career=ugrd
The JSON format looks like:
{
"data":{
"totalItems":10000,
"currentItemCount":200,
"page":1,
"totalPages":50,
"refineQueryTemplate":"q=_QUERY_&campus=col&academic-career=ugrd&p=1",
"nextPageLink":"?q=&campus=col&academic-career=ugrd&p=2",
"prevPageLink":null,
"activeSort":"",
"courses":[
{
"course":{
"term":"Summer 2021",
"effectiveDate":"2019-01-06",
"effectiveStatus":"A",
"title":"Dental Hygiene Practicum",
"shortDescription":"DHY Practicum",
"description":"Supervised practice outside the traditional clinic in a setting similar to one in which the dental hygiene student may practice, teach, or conduct research upon graduation.\nPrereq: Sr standing in DHY or BDCP major. Repeatable to a maximum of 4 cr hrs or 4 completions. This course is graded S/U.",
"equivalentId":"S1989",
"allowMultiEnroll":"N",
"maxUnits":4,
"minUnits":1,
"repeatUnitsLimit":4,
"grading":"Satisfactory/Unsatisfactory",
"component":"Field Experience",
"primaryComponent":"FLD",
"offeringNumber":"1",
"academicGroup":"Dentistry",
"subject":"DENTHYG",
"catalogNumber":"4430",
"campus":"Columbus",
"academicOrg":"D2120",
"academicCareer":"Undergraduate",
"cipCode":"51.0602",
"courseAttributes":[
{
"name":"CCP",
"value":"NON-CCP",
"description":"Not eligible for College Credit Plus program"
}
],
"campusCode":"COL",
"catalogLevel":"4xxx",
"subjectDesc":"Dental Hygiene",
"courseId":"152909"
},
"sections":[
{
"classNumber":"20850",
"section":"10",
"component":"Field Experience",
"instructionMode":"Distance Learning",
"meetings":[
{
"meetingNumber":1,
"facilityId":null,
"facilityType":null,
"facilityDescription":null,
"facilityDescriptionShort":null,
"facilityGroup":null,
"facilityCapacity":0,
"buildingCode":null,
"room":null,
"buildingDescription":null,
"buildingDescriptionShort":null,
"startTime":null,
"endTime":null,
"startDate":"2021-05-12",
"endDate":"2021-07-30",
"monday":false,
"tuesday":false,
"wednesday":false,
"thursday":false,
"friday":false,
"saturday":false,
"sunday":false,
"standingMeetingPattern":null,
"instructors":[
{
"displayName":"Irina A Novopoltseva",
"role":"PI",
"email":"novopoltseva.1#osu.edu"
}
]
}
],
"courseOfferingNumber":1,
"courseId":"152909",
"academicGroup":"DEN",
"subject":"Dental Hygiene",
"catalogNumber":"4430",
"career":"UGRD",
"description":"DHY Practicum",
"enrollmentStatus":"Open",
"status":"A",
"type":"E",
"associatedClass":"10",
"autoEnrollWaitlist":true,
"autoEnrollSection1":null,
"autoEnrollSection2":null,
"consent":"D",
"waitlistCapacity":5,
"minimumEnrollment":0,
"enrollmentTotal":1,
"waitlistTotal":0,
"academicOrg":"D2120",
"location":"CS-COLMBUS",
"equivalentCourseId":null,
"startDate":"2021-05-12",
"endDate":"2021-07-30",
"cancelDate":null,
"primaryInstructorSection":"10",
"combinedSection":null,
"holidaySchedule":"OSUSIS",
"sessionCode":"1S",
"sessionDescription":"Summer Term",
"term":"Summer 2021",
"campus":"Columbus",
"attributes":[
{
"name":"CCP",
"value":"NON-CCP",
"description":"Not eligible for College Credit Plus program"
}
],
"secCampus":"COL",
"secAcademicGroup":"DEN",
"secCatalogNumber":"4430",
"meetingDays":"",
"_parent":"152909-1-1214",
"subjectDesc":"Dental Hygiene",
"courseTitle":"Dental Hygiene Practicum",
"courseDescription":"Supervised practice outside the traditional clinic in a setting similar to one in which the dental hygiene student may practice, teach, or conduct research upon graduation.\nPrereq: Sr standing in DHY or BDCP major. Repeatable to a maximum of 4 cr hrs or 4 completions. This course is graded S/U.",
"catalogLevel":"4xxx",
"termCode":"1214"
}
]
},
{
"course":{
"term":"Spring 2021",
"effectiveDate":"2020-08-24",
"effectiveStatus":"A",
"title":"Undergraduate Research in Public Health",
"shortDescription":"Res Pub Hlth",
"description":"Undergraduate research under the guidance of a faculty mentor in a basic or applied area of public health.\nPrereq: Jr or Sr standing, and enrollment in BSPH major, and permission of advisor. Students who are not junior or senior standing may be eligible with faculty mentor approval. Repeatable to a maximum of 6 cr hrs. This course is graded S/U.",
"equivalentId":"",
"allowMultiEnroll":"N",
"maxUnits":6,
"minUnits":1,
"repeatUnitsLimit":6,
"grading":"Satisfactory/Unsatisfactory",
"component":"Independent Study",
"primaryComponent":"IND",
"offeringNumber":"1",
"subject":"PUBHLTH",
"catalogNumber":"4998",
"campus":"Columbus",
"academicOrg":"D2505",
"academicCareer":"Undergraduate",
"cipCode":"51.2201",
"courseAttributes":[
{
"name":"CCP",
"value":"NON-CCP",
"description":"Not eligible for College Credit Plus program"
}
],
"campusCode":"COL",
"catalogLevel":"4xxx",
"subjectDesc":"Public Health",
"courseId":"160532"
},
"sections":[
{
"classNumber":"3557",
"section":"0030",
"component":"Independent Study",
"instructionMode":"In Person",
"meetings":[
{
"meetingNumber":1,
"facilityId":null,
"facilityType":null,
"facilityDescription":null,
"facilityDescriptionShort":null,
"facilityGroup":null,
"facilityCapacity":0,
"buildingCode":null,
"room":null,
"buildingDescription":null,
"buildingDescriptionShort":null,
"startTime":null,
"endTime":null,
"startDate":"2021-01-11",
"endDate":"2021-04-23",
"monday":false,
"tuesday":false,
"wednesday":false,
"thursday":false,
"friday":false,
"saturday":false,
"sunday":false,
"standingMeetingPattern":null,
"instructors":[
{
"displayName":"Abigail Norris Turner",
"role":"PI",
"email":"norris-turner.1#osu.edu"
}
]
}
],
"courseOfferingNumber":1,
"courseId":"160532",
"academicGroup":"PBH",
"subject":"Public Health",
"catalogNumber":"4998",
"career":"UGRD",
"description":"Res Pub Hlth",
"enrollmentStatus":"Open",
"status":"A",
"type":"E",
"associatedClass":"1",
"autoEnrollWaitlist":true,
"autoEnrollSection1":null,
"autoEnrollSection2":null,
"consent":"I",
"waitlistCapacity":99,
"minimumEnrollment":0,
"enrollmentTotal":0,
"waitlistTotal":0,
"academicOrg":"D2505",
"location":"CS-COLMBUS",
"equivalentCourseId":null,
"startDate":"2021-01-11",
"endDate":"2021-04-23",
"cancelDate":null,
"primaryInstructorSection":"0010",
"combinedSection":null,
"holidaySchedule":"OSUSIS",
"sessionCode":"1",
"sessionDescription":"Regular Academic Term",
"term":"Spring 2021",
"campus":"Columbus",
"attributes":[
{
"name":"CCP",
"value":"NON-CCP",
"description":"Not eligible for College Credit Plus program"
}
],
"secCampus":"COL",
"secAcademicGroup":"PBH",
"secCatalogNumber":"4998",
"meetingDays":"",
"_parent":"160532-1-1212",
"subjectDesc":"Public Health",
"courseTitle":"Undergraduate Research in Public Health",
"courseDescription":"Undergraduate research under the guidance of a faculty mentor in a basic or applied area of public health.\nPrereq: Jr or Sr standing, and enrollment in BSPH major, and permission of advisor. Students who are not junior or senior standing may be eligible with faculty mentor approval. Repeatable to a maximum of 6 cr hrs. This course is graded S/U.",
"catalogLevel":"4xxx",
"termCode":"1212"
}
]
},
{
"course":{
"term":"Spring 2021",
"effectiveDate":"2013-05-05",
"effectiveStatus":"A",
"title":"Individual Studies in Public Health",
"shortDescription":"Ind Study Pub Hlth" ```
But when I use this code to scrape the pages, it just repeats.
import requests
session = requests.Session()
def get_classes():
url = "https://content.osu.edu/v2/classes/search?q=&campus=col&academic- career=ugrd"
first_page = session.get(url).json()
yield first_page
num_pages = first_page['data']['totalPages']
for page in range(0, num_pages + 1):
next_page = session.get(url, params={'page': page}).json()
yield next_page
for page in get_classes():
data = page['data']['courses']
array_length = len(data)
for i in range(array_length):
if (i <= array_length):
course_key = data[i]['course']
subject = course_key['subject']
number = course_key['catalogNumber']
title = course_key['title']
units = course_key['minUnits']
component = course_key['component']
attributes = course_key['courseAttributes']
description = course_key['description']
else:
break
I want to scrape all the data from the page and then proceed to the next page until I scraped all the pages. Instead, it just prints the same page over and over again.

You can see the next page link in the response:
"nextPageLink":"?q=&campus=col&academic-career=ugrd&p=2",
So you should use p instead of page.

How to Return Only Selected API Parameters in Python

how can i return or select only those parameters that are needed in Python dict format. Not all of the parameters that are being return.
Here is the url we use:
https://api.nytimes.com/svc/search/v2/articlesearch.json?begin_date=20201020&facet=false&sort=newest&api-key=[YOUR_API_KEY]
Here is the response we get:
{
"status": "OK",
"copyright": "Copyright (c) 2020 The New York Times Company. All Rights Reserved.",
"response": {
"docs": [
{
"abstract": "Our latest survey shows a shift toward Biden among college-educated white voters, but surprising Trump gains among nonwhite voters.",
"web_url": "https://www.nytimes.com/2020/10/20/upshot/poll-georgia-biden-trump.html",
"snippet": "Our latest survey shows a shift toward Biden among college-educated white voters, but surprising Trump gains among nonwhite voters.",
"lead_paragraph": "A shift against President Trump among white college-educated voters in Georgia has imperiled Republicans up and down the ballot, according to a New York Times/Siena College survey on Tuesday, as Republicans find themselves deadlocked or trailing in Senate races where their party was once considered the heavy favorite.",
"source": "The New York Times",
"multimedia": [
{
"rank": 0,
"subtype": "xlarge",
"caption": null,
"credit": null,
"type": "image",
"url": "images/2020/10/20/us/undefined-promo-1603200878027/undefined-promo-1603200878027-articleLarge.jpg",
"height": 399,
"width": 600,
"legacy": {
"xlarge": "images/2020/10/20/us/undefined-promo-1603200878027/undefined-promo-1603200878027-articleLarge.jpg",
"xlargewidth": 600,
"xlargeheight": 399
},
"subType": "xlarge",
"crop_name": "articleLarge"
},
..........
How can i only return for example:
web_url and source parameters in Python?
Please help !!!
this is the code i use, but it returns all parameters:
import requests
import os
from pprint import pprint
apikey = os.getenv('VGSDRL9bWiWy70GdCPA4QX8flAsemVGJ', '...')
query_url = "https://api.nytimes.com/svc/search/v2/articlesearch.json?q=trump&sort=newest&api-key=VGSDRL9bWiWy70GdCPA4QX8flAsemVGJ"
r = requests.get(query_url)
pprint(r.json())

r = requests.get(query_url)
filtered = [{'web_url': d['web_url'], 'source': d['source']} for d in
r.json()['response']['docs']]
pprint(filtered)

Get a list of elements into separated arrays

Hello fellow developer out there,
I'm new to Python & I need to write a web scraper to catch info from Scholar Google.
I ended up coding this function to get values using Xpath:
thread = browser.find_elements(By.XPATH,(" %s" % exp))
xArray = []
for t in thread:
if not atr:
xThread = t.text
else:
xThread = t.get_attribute('href')
xArray.append(xThread)
return xArray
I don't know if it's a good or a bad solution. So, I humbly accept any suggestions to make it work better.
Anyway, my actual problem is that I am getting all authors name from the page I am scraping and what I really need are the names, grouped by result.
When I ask to print the results I wish I could have something like this:
[[author1, author2,author 3],[author 4,author 5,author6]]
What am I getting right now is:
[author1,author3,author4,author5,author6]
The structure is as follows:
<div class="gs_a">
LR Hisch,
AM Gobin
,AR Lowery,
F Tam
... -Annals of biomedical ...,2006 - Springer
</div>
And the same structure is repetead all over the page for different documents and authors.
And this is the call to the function I explained earlier:
authors = (clothoSpins(".//*[#class='gs_a']//a"))
Which gets me the entire list of authors.

Here is the logic (used selenium in the below code but update it as per your need).
Logic:
url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C21&q=python&btnG="
driver.get(url)
# get the authors and add to list
listBooks = []
books = driver.find_elements_by_xpath("//div[#class='gs_a']")
for bookNum in books:
auths = []
authors = driver.find_elements_by_xpath("(//div[#class='gs_a'])[%s]/a|(//div[#class='gs_a'])[%s]/self::*[not(a)]"%(bookNum+1,bookNum+1))
for author in authors:
auths.append(author.text)
listBooks.append(auths)
Output:
[['F Pedregosa', 'G Varoquaux', 'A Gramfort'], ['PD Adams', 'PV Afonine'], ['TE Oliphant'], ['JW Peirce'], ['S Anders', 'PT Pyl', 'W Huber'], ['MF Sanner'], ['S Bird', 'E Klein'], ['M Lutz - 2001 - books.google.com'], ['G Rossum - 1995 - dl.acm.org'], ['W McKinney - … of the 9th Python in Science Conference, 2010 - pdfs.semanticscholar.org']]
Screenshot:

To group by result you can create an empty list, iterate over results, and append extracted data to the list as a dict, and returned result could be serialized to a JSON string using json_dumps() method e.g:
temp_list = []
for result in results:
# extracting title, link, etc.
temp_list.append({
"title": title,
# other extracted elements
})
print(json.dumps(temp_list, indent=2))
"""
Returned results is a list of dictionaries:
[
{
"title": "A new biology for a new century",
# other extracted elements..
}
]
"""
Code and full example in the online IDE:
from parsel import Selector
import requests, json, re
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "biology", # search query
"hl": "en" # language
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}
html = requests.get("https://scholar.google.com/scholar", params=params, headers=headers, timeout=30)
selector = Selector(html.text)
data = []
for result in selector.css(".gs_ri"):
# xpath("normalize-space()") to get blank text nodes as well to get the full string output
title = result.css(".gs_rt a").xpath("normalize-space()").get()
# https://regex101.com/r/7bmx8h/1
authors = re.search(r"^(.*?)-", result.css(".gs_a").xpath("normalize-space()").get()).group(1).strip()
snippet = result.css(".gs_rs").xpath("normalize-space()").get()
# https://regex101.com/r/47erNR/1
year = re.search(r"\d+", result.css(".gs_a").xpath("normalize-space()").get()).group(0)
# https://regex101.com/r/13468d/1
publisher = re.search(r"\d+\s?-\s?(.*)", result.css(".gs_a").xpath("normalize-space()").get()).group(1)
cited_by = int(re.search(r"\d+", result.css(".gs_or_btn.gs_nph+ a::text").get()).group(0))
data.append({
"title": title,
"snippet": snippet,
"authors": authors,
"year": year,
"publisher": publisher,
"cited_by": cited_by
})
print(json.dumps(data, indent=2, ensure_ascii=False))
Output:
[
{
"title": "A new biology for a new century",
"snippet": "… A society that permits biology to become an engineering discipline, that allows that science … science of biology that helps us to do this, shows the way. An engineering biology might still …",
"authors": "CR Woese",
"year": "2004",
"publisher": "Am Soc Microbiol",
"cited_by": 743
}, ... other results
{
"title": "Campbell biology",
"snippet": "… Now, Campbell series Biology texts are institutionalized. This is the standard biology text across colleges in the US To say the authors and editors know what they are doing at this point …",
"authors": "JB Reece, LA Urry, ML Cain, SA Wasserman…",
"year": "2014",
"publisher": "fvsuol4ed.org",
"cited_by": 1184
}
]
Note: in the example above, I'm using parsel library which is very similar to beautifulsoup and selenium in terms of data extraction.
Alternatively, you can achieve the same thing by using Google Scholar Organic Results API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to create the parser from scratch, maintain it, figure out how to scale it without getting blocked.
Example code to integrate:
from serpapi import GoogleSearch
import os, json
params = {
"api_key": os.getenv("API_KEY"), # SerpApi API key
"engine": "google_scholar", # parsing engine
"q": "biology", # search query
"hl": "en" # language
}
search = GoogleSearch(params) # where data extraction happens
results = search.get_dict() # JSON -> Python dictionary
for result in results["organic_results"]:
print(json.dumps(result, indent=2))
Output:
{
"position": 0,
"title": "A new biology for a new century",
"result_id": "KNJ0p4CbwgoJ",
"link": "https://journals.asm.org/doi/abs/10.1128/MMBR.68.2.173-186.2004",
"snippet": "\u2026 A society that permits biology to become an engineering discipline, that allows that science \u2026 science of biology that helps us to do this, shows the way. An engineering biology might still \u2026",
"publication_info": {
"summary": "CR Woese - Microbiology and molecular biology reviews, 2004 - Am Soc Microbiol"
},
"resources": [
{
"title": "nih.gov",
"file_format": "HTML",
"link": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC419918/"
},
{
"title": "View it # CTU",
"link": "https://scholar.google.com/scholar?output=instlink&q=info:KNJ0p4CbwgoJ:scholar.google.com/&hl=en&as_sdt=0,11&scillfp=15047057806408271473&oi=lle"
}
],
"inline_links": {
"serpapi_cite_link": "https://serpapi.com/search.json?engine=google_scholar_cite&q=KNJ0p4CbwgoJ",
"html_version": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC419918/",
"cited_by": {
"total": 743,
"link": "https://scholar.google.com/scholar?cites=775353062728716840&as_sdt=80005&sciodt=0,11&hl=en",
"cites_id": "775353062728716840",
"serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=80005&cites=775353062728716840&engine=google_scholar&hl=en"
},
"related_pages_link": "https://scholar.google.com/scholar?q=related:KNJ0p4CbwgoJ:scholar.google.com/&scioq=biology&hl=en&as_sdt=0,11",
"versions": {
"total": 20,
"link": "https://scholar.google.com/scholar?cluster=775353062728716840&hl=en&as_sdt=0,11",
"cluster_id": "775353062728716840",
"serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=0%2C11&cluster=775353062728716840&engine=google_scholar&hl=en"
}
}
}
{
"position": 9,
"title": "Campbell biology",
"result_id": "YnWp49O_RTMJ",
"type": "Book",
"link": "http://www.fvsuol4ed.org/reviews/Biology%20Organismal%20Template_Campbell%20Biology_Moran.pdf",
"snippet": "\u2026 Now, Campbell series Biology texts are institutionalized. This is the standard biology text across colleges in the US To say the authors and editors know what they are doing at this point \u2026",
"publication_info": {
"summary": "JB Reece, LA Urry, ML Cain, SA Wasserman\u2026 - 2014 - fvsuol4ed.org"
},
"resources": [
{
"title": "fvsuol4ed.org",
"file_format": "PDF",
"link": "http://www.fvsuol4ed.org/reviews/Biology%20Organismal%20Template_Campbell%20Biology_Moran.pdf"
}
],
"inline_links": {
"serpapi_cite_link": "https://serpapi.com/search.json?engine=google_scholar_cite&q=YnWp49O_RTMJ",
"cited_by": {
"total": 1184,
"link": "https://scholar.google.com/scholar?cites=3694569986105898338&as_sdt=80005&sciodt=0,11&hl=en",
"cites_id": "3694569986105898338",
"serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=80005&cites=3694569986105898338&engine=google_scholar&hl=en"
},
"related_pages_link": "https://scholar.google.com/scholar?q=related:YnWp49O_RTMJ:scholar.google.com/&scioq=biology&hl=en&as_sdt=0,11",
"versions": {
"total": 33,
"link": "https://scholar.google.com/scholar?cluster=3694569986105898338&hl=en&as_sdt=0,11",
"cluster_id": "3694569986105898338",
"serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=0%2C11&cluster=3694569986105898338&engine=google_scholar&hl=en"
},
"cached_page_link": "http://scholar.googleusercontent.com/scholar?q=cache:YnWp49O_RTMJ:scholar.google.com/+biology&hl=en&as_sdt=0,11"
}
}
If you need to parse data from all Google Scholar Organic results, there's a dedicated Scrape historic 2017-2021 Organic, Cite Google Scholar results to CSV, SQLite blog post of mine at SerpApi that shows how to do it with API.
Disclaimer, I work for SerpApi.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python scrapy yield to .json file not working - python

Related

How to write a JSONL file in Python

Can't stratify output based on different headings and their corresponding paragraphs

for loop keeps repeating when trying to scrape API to next page

How to Return Only Selected API Parameters in Python

Get a list of elements into separated arrays

Categories

Resources