Python Scrapy is not getting all html elements from a webpage - python

I am trying to use Scrapy to get the names of all current WWE superstars from the following url: http://www.wwe.com/superstars
However, when I run my scraper, it does not return any names. I believe (through attempting the problem with other modules) that the problem is that Scrapy is not finding all of the html elements from the page. I attempted the problem with requests and Beautiful Soup, and when I looked at the html that requests got, it was missing important aspects of the html that I was seeing in my browsers inspector. The html containing the names looks like this:
<div class="superstars--info"> == $0
<span class="superstars--name">name here</span>
</div>
My code is posted below. Is there something that I am doing wrong that is causing this not to work?
import scrapy
class SuperstarSpider(scrapy.Spider):
name = "star_spider"
start_urls = ["http://www.wwe.com/superstars"]
def parse(self, response):
star_selector = '.superstars--info'
for star in response.css(star_selector):
NAME_SELECTOR = 'span ::text'
yield {
'name' : star.css(NAME_SELECTOR).extract_first(),
}

Sounds like the site has dynamic content which maybe loaded using javascript and/or xhr calls. Look into splash it's a javascript render engine that behaves a lot like phantomjs. If you know how to use docker, splash is super simple to setup. After you have splash setup, you'll have to integrate it with scrapy by using the scrapy-splash plugin.

Since the content is javascript generated, you have two options: use something like selenium to mimic a browser and parse the html content, or if you can, query an API directly.
In this case, this simple solution works:
import requests
import json
URL = "http://www.wwe.com/api/superstars"
with requests.session() as s:
s.headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0'}
resp = s.get(URL).json()
for x in resp['talent'][:10]:
print(x['name'])
Output (first 10 records):
Abdullah the Butcher
Adam Bomb
Adam Cole
Adam Rose
Aiden English
AJ Lee
AJ Styles
Akam
Akeem
Akira Tozawa

Related

I can't spot some of the elements in the site's source code

I was trying to scrape this website to get the player data.
https://mystics.wnba.com/roster/
I viewed the code using 'Inspect' but the main table isn't in the source code. For example, this is the code for the first player's name:
<div class="content-table__player-name">
<a ng-href="https://www.wnba.com/player/ariel-atkins/" target="_self" href="https://www.wnba.com/player/ariel-atkins/">Ariel Atkins</a>
</div>
I can't find this piece of code (or any code for the player data) in the page source. I searched for most of the table's divs in the source code but I couldn't find any of them.
The content is generated on the fly, using some JavaScript. To get the data you want, your program need to be able to run and interpret JavaScript. You can use tools like Selenium or the headless mode of Chrome, to extract the DOM from a running browser.
In Firefox you can press F12 to inspect the DOM that was generated by the JavaScript code. In there, you can locate the desired entries. You can also inspect the Network tab, which shows you the requests the site is sending to the server. You might be able identify the requests that return your desired results.
As the tag contains scrapy. So, here is a solution using scrapy.
import scrapy
import json
class Test(scrapy.Spider):
name = 'test'
start_urls = ['https://data.wnba.com/data/5s/v2015/json/mobile_teams/wnba/2021/teams/mystics_roster.json']
def parse(self, response):
data = json.loads(response.body)
data = data.get('t').get('pl')
for player in data:
print(player.get('fn'),player.get('ln'))
The following is how you can access the content using requests module.
import requests
link = 'https://data.wnba.com/data/5s/v2015/json/mobile_teams/wnba/2021/teams/mystics_roster.json'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
res = s.get(link)
for item in res.json()['t']['pl']:
print(item['fn'],item['ln'])
Output:
Leilani Mitchell
Shavonte Zellous
Tina Charles
Elena Delle Donne
Theresa Plaisance
Natasha Cloud
Shatori Walker-Kimbrough
Sydney Wiese
Erica McCall
Ariel Atkins
Myisha Hines-Allen
Megan Gustafson

How to scrape Chegg Textbook Solution pages using python?

Long story short, I was re-visiting exercises in an old VBA textbook to do some practice (specifically VBA for Modelers - 5th Edition, S. Christian Albright).
In doing so I wanted to retrieve the answers for the exercises and in doing so I came to Chegg and thought I could try to scrape the code blocks in the solution pages (example hyperlinked below).
Sample Chegg Textbook Solution Page - code block and HTML in red rectangles
I've been trying to get more acquainted with python and thought this would be a good project to learn more about web scraping.
Below is the code I began with as I realized that it would not be as simple as scraping the HTML from each solution page. I initially just wanted to find all div elements on the page itself before going further and looping through each exercise page, and scraping the code blocks as such.
#!/usr/bin/python3
# scrapeChegg.py - Scrapes all answer code blocks from each problem exercise in each chapter for a textbook (VBA For Modelers - 5th Editiion)
import bs4, os, requests
# Starting URL point
url = 'https://www.chegg.com/homework-help/open-new-workbook-get-vbe-insert-module-enter-following-code-chapter-5-problem-1e-solution-9781285869612-exc'
# Retrieve sol'n HTML
head = {'User Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:92.0) Gecko/20100101 Firefox/92.0'}
res = requests.get(url, headers=head)
try:
res.status_code
cheggSoup = bs4.BeautifulSoup(res.text, 'html.parser')
print(cheggSoup.find_all('div'))
except Exception as exc:
print('Issue occurred: %s' % (exc))
Within one of the div results, the output was as follows:
<p>
Access to this page has been denied because we believe you are using automation tools to browse the
website.
</p>
<p>
This may happen as a result of the following:
</p>
<ul>
<li>
Javascript is disabled or blocked by an extension (ad blockers for example)
</li>
<li>
Your browser does not support cookies
</li>
</ul>
<p>
Please make sure that Javascript and cookies are enabled on your browser and that you are not blocking
them from loading.
</p>
<p>
Reference ID: #5ca2ea20-0052-11ec-8c04-7749576e4445
</p>
</div>
So based on the above, I can see that the page is stopping me from using automation tools. I've looked at similar issues that people have brought up concerning scraping from Chegg, and a lot of solutions are beyond my current knowledge (i.e. various solutions had more key/value pairs within the head dict that I was not sure how to interpret).
Essentially my question is how can I gain more knowledge (or what resources should I look deeper into - i.e. HTTP, scraping with python, etc.) to make this project work, if possible that is. If anyone has made something like this work before, I would appreciate any advice on what to look at for myself or how I can make this specific project successful. Thanks!
Try to add - in User Agent HTTP header:
import requests
from bs4 import BeautifulSoup
url = "https://www.chegg.com/homework-help/open-new-workbook-get-vbe-insert-module-enter-following-code-chapter-5-problem-1e-solution-9781285869612-exc"
# Retrieve sol'n HTML
head = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:92.0) Gecko/20100101 Firefox/92.0",
}
res = requests.get(url, headers=head)
soup = BeautifulSoup(res.content, "html.parser")
print(soup.h1.text)
Prints:
VBA for Modelers (5th Edition) Edit editionThis problem has been solved:Solutions for Chapter 5Problem 1E: Open a new workbook, get into the VBE, insert a module, and enter the following code:Sub Variables() Dim nPounds As Integer, dayOfWeek As Integer nPounds = 17.5 dayOfWeek = “Monday” MsgBox nPounds & “ pounds were ordered on ” & dayOfWeekEnd SubThere are two problems here. One causes the program to fail, and the other causes an incorrect result. Explain what they are and then fix them.…

Python web scraping with bs4 on Patreon

I've written a script that looks up a few blogs and sees if a new post has been added. However, when I try to do this on Patreon I cannot find the right element with bs4.
Let's take https://www.patreon.com/cubecoders for example.
Say I want to get the number of exclusive posts under the 'Become a patron to' section, which would be 25 as of now.
This code works just fine:
import requests
from bs4 import BeautifulSoup
plain_html = requests.get("https://www.patreon.com/cubecoders").text
full_html = BeautifulSoup(plain_html, "html.parser")
text_of_newest_post = full_html.find("div", class_="sc-AxjAm fXpRSH").text
print(text_of_newest_post)
Output: 25
Now, I want to get the title of the newest post, which would be 'New in AMP 2.0.2 - Integrated SCP/SFTP server!' as of now.
I inspect the title in my browser and see that it is contained by a span tag with the class 'sc-1di2uql-1 vYcWR'.
However, when I try to run this code I cannot fetch the element:
import requests
from bs4 import BeautifulSoup
plain_html = requests.get("https://www.patreon.com/cubecoders").text
full_html = BeautifulSoup(plain_html, "html.parser")
text_of_newest_post = full_html.find("span", class_="sc-1di2uql-1 vYcWR")
print(text_of_newest_post)
Output: None
I've already tried to fetch the element with XPath or CSS selector but couldn't do it. I thought it might be because the site is rendered first with JavaScript and thus I cannot access the elements before they are rendered correctly.
When I use Selenium to render the site first I can see the title when printing out all div tags on the page but when I want to get only the very first title I can't access it.
Do you guys know a workaround maybe?
Thanks in advance!
EDIT:
In Selenium I can do this:
from selenium import webdriver
browser = webdriver.Chrome("C:\webdrivers\chromedriver.exe")
browser.get("https://www.patreon.com/cubecoders")
divs = browser.find_elements_by_tag_name("div")
def find_text(divs):
for div in divs:
for span in div.find_elements_by_tag_name("span"):
if span.get_attribute("class") == "sc-1di2uql-1 vYcWR":
return span.text
print(find_text(divs))
browser.close()
Output: New in AMP 2.0.2 - Integrated SCP/SFTP server!
When I just try to search for the spans with class 'sc-1di2uql-1 vYcWR' from the start it won't give me the result though. Could it be that the find_elements method does not look deeper inside for nestled tags?
The data you see is loaded via Ajax from their API. You can use requests module to load the data.
For example:
import re
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.patreon.com/cubecoders'
api_url = 'https://www.patreon.com/api/posts'
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0',
'Accept-Language': 'en-US,en;q=0.5',
'Referer': url
}
with requests.session() as s:
html_text = s.get(url, headers=headers).text
campaign_id = re.search(r'https://www\.patreon\.com/api/campaigns/(\d+)', html_text).group(1)
data = s.get(api_url, headers=headers, params={'filter[campaign_id]': campaign_id, 'filter[contains_exclusive_posts]': 'true', 'sort': '-published_at'}).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
# print some information to screen:
for d in data['data']:
print('{:<70} {}'.format(d['attributes']['title'], d['attributes']['published_at']))
Prints:
New in AMP 2.0.2 - Integrated SCP/SFTP server! 2020-07-17T13:28:49.000+00:00
AMP Enterprise Pricing Reveal! 2020-07-07T10:02:02.000+00:00
AMP Enterprise Edition Waiting List 2020-07-03T13:25:35.000+00:00
Upcoming changes to the user system 2020-05-29T10:53:43.000+00:00
More video tutorials! What do you want to see? 2020-05-21T12:20:53.000+00:00
Third AMP tutorial - Windows installation! 2020-05-21T12:19:23.000+00:00
Another day, another video tutorial! 2020-05-08T22:56:45.000+00:00
AMP Video Tutorial - Out takes! 2020-05-05T23:01:57.000+00:00
AMP Video Tutorials - Installing AMP on Linux 2020-05-05T23:01:46.000+00:00
What is the AMP Console Assistant (AMPCA), and why does it exist? 2020-05-04T01:14:39.000+00:00
Well that was unexpected... 2020-05-01T11:21:09.000+00:00
New Goal - MariaDB/MySQL Support! 2020-04-22T13:41:51.000+00:00
Testing out AMP Enterprise Features 2020-03-31T18:55:42.000+00:00
Temporary feature unlock for all Patreon backers! 2020-03-11T14:53:31.000+00:00
Preparing for Enterprise 2020-03-11T13:09:40.000+00:00
Aarch64/ARM64 and Raspberry Pi is here! 2020-03-06T19:07:09.000+00:00
Aarch64/ARM64 and Raspberry Pi progress! 2020-02-26T17:53:53.000+00:00
Wallpaper! 2020-02-13T11:04:39.000+00:00
Instance Templating - Make once, deploy many. 2020-02-06T15:26:09.000+00:00
Time for a new module! 2020-01-07T13:41:17.000+00:00

Identifying issue in retrieving href from Google Scholar

Having trouble scraping links and article names from google scholar. I'm unsure if the issue is with my code or the xpath that I'm using to retrieve the data – or possibly both?
I've already spent the past few hours trying to debug/consulting other stackoverflow queries but to no success.
import scrapy
from scrapyproj.items import ScrapyProjItem
class scholarScrape(scrapy.Spider):
name = "scholarScraper"
allowed_domains = "scholar.google.com"
start_urls=["https://scholar.google.com/scholar?hl=en&oe=ASCII&as_sdt=0%2C44&q=rare+disease+discovery&btnG="]
def parse(self,response):
item = ScrapyProjItem()
item['hyperlink'] = item.xpath("//h3[class=gs_rt]/a/#href").extract()
item['name'] = item.xpath("//div[#class='gs_rt']/h3").extract()
yield item
The error messages I have been receiving say: "AttributeError: xpath" so I believe that the issue lies with the path that I'm using to try and retrieve the data, but I could also be mistaken?
Adding my comment as an answer, as it solved the problem:
The issue is with scrapyproj.items.ScrapyProjItem objects: they do not have an xpath attribute. Is this an official scrapy class? I think you meant to call xpath on response:
item['hyperlink'] = response.xpath("//h3[class=gs_rt]/a/#href").extract()
item['name'] = response.xpath("//div[#class='gs_rt']/h3").extract()
Also, the first path expression might need a set of quotes around the attribute value "gs_rt":
item['hyperlink'] = response.xpath("//h3[class='gs_rt']/a/#href").extract()
Apart from that, the XPath expressions are fine.
Alternative solution using bs4:
from bs4 import BeautifulSoup
import requests, lxml, os
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://scholar.google.com/citations?hl=en&user=m8dFEawAAAAJ', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
# Container where all articles located
for article_info in soup.select('#gsc_a_b .gsc_a_t'):
# title CSS selector
title = article_info.select_one('.gsc_a_at').text
# Same title CSS selector, except we're trying to get "data-href" attribute
# Note, it will be relative link, so you need to join it with absolute link after extracting.
title_link = article_info.select_one('.gsc_a_at')['data-href']
print(f'Title: {title}\nTitle link: https://scholar.google.com{title_link}\n')
# Part of the output:
'''
Title: Automating Gödel's Ontological Proof of God's Existence with Higher-order Automated Theorem Provers.
Title link: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=m8dFEawAAAAJ&citation_for_view=m8dFEawAAAAJ:-f6ydRqryjwC
'''
Alternatively, you can do the same with Google Scholar Author Articles API from SerpApi.
The main difference is that you don't have to think about finding good proxies, trying to solve CAPTCHA even if you're using selenium. It's a paid API with a free plan.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar_author",
"author_id": "9PepYk8AAAAJ",
}
search = GoogleSearch(params)
results = search.get_dict()
for article in results['articles']:
article_title = article['title']
article_link = article['link']
# Part of the output:
'''
Title: p-GaN gate HEMTs with tungsten gate metal for high threshold voltage and low gate current
Link: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=9PepYk8AAAAJ&citation_for_view=9PepYk8AAAAJ:bUkhZ_yRbTwC
'''
Disclaimer, I work for SerpApi.

Scraping with drop down menu + button using Python

I'm trying scrape data from Mexico's Central Bank website but have hit a wall. In terms of actions, I need to first access a link within an initial URL. Once the link has been accessed, I need to select 2 dropdown values and then hit an activate a submit button. If all goes well, I will be taken to a new url where a set of links to pdfs are available.
The original url is:
"http://www.banxico.org.mx/mercados/valores-gubernamentales-secto.html"
The nested URL (the one with the dropbox) is:
"http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces?BMXC_claseIns=GUB&BMXC_lang=es_MX"
The inputs (arbitrary) are, say: '07/03/2019' and '14/03/2019'.
Using BeautifulSoup and requests I feel like I got as far as filling the values in the dropbox, but failed to click the button and achieve the final url with the list of links.
My code follows below :
from bs4 import BeautifulSoup
import requests
pagem=requests.get("http://www.banxico.org.mx/mercados/valores-gubernamentales-secto.html")
soupm = BeautifulSoup(pagem.content,"lxml")
lst=soupm.find_all('a', href=True)
url=lst[-1]['href']
page = requests.get(url)
soup = BeautifulSoup(page.content,"lxml")
xin= soup.find("select",{"id":"_id0:selectOneFechaIni"})
xfn= soup.find("select",{"id":"_id0:selectOneFechaFin"})
ino=list(xin.stripped_strings)
fino=list(xfn.stripped_strings)
headers = {'Referer': url}
data = {'_id0:selectOneFechaIni':'07/03/2019', '_id0:selectOneFechaFin':'14/03/2019',"_id0:accion":"_id0:accion"}
respo=requests.post(url,data,headers=headers)
print(respo.url)
In the code, respo.url is equal to url...the code fails. Can anybody pls help me identify where the problem is? I'm a newbie to scraping so that might be obvious - apologize in advance for that...I'd appreciate any help. Thanks!
Last time I checked, you cannot submit a form via clicking buttons with BeautifulSoup and Python. There are typically two approaches I often see:
Reverse engineer the form
If the form makes AJAX calls (e.g. makes a request behind the scenes, common for SPAs written in React or Angular), then the best approach is to use the network requests tab in Chrome or another browser to understand what the endpoint is and what the payload is. Once you have those answers, you can make a POST request with the requests library to that endpoint with data=your_payload_dictionary (e.g. manually do what the form is doing behind the scenes). Read this post for a more elaborate tutorial.
Use a headless browser
If the website is written in something like ASP.NET or a similar MVC framework, then the best approach is to use a headless browser to fill out a form and click submit. A popular framework for this is Selenium. This simulates a normal browser. Read this post for a more elaborate tutorial.
Judging by a cursory look at the page you're working on, I recommend approach #2.
The page you have to scrape is:
http://www.banxico.org.mx/valores/PresentaDetalleSectorizacionGubHist.faces
Add the date to consult and JSESSIONID from cookies in the payload and Referer , User-Agent and all the old good stuff in request headers
Example:
import requests
import pandas as pd
cl = requests.session()
url = "http://www.banxico.org.mx/valores/PresentaDetalleSectorizacionGubHist.faces"
payload = {
"JSESSIONID": "cWQD8qxoNJy_fecAaN2k8N0IQ6bkQ7f3AtzPx4bWL6wcAmO0T809!-1120047000",
"fechaAConsultar": "21/03/2019"
}
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36",
"Content-Type": "application/x-www-form-urlencoded",
"Referer": "http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces;jsessionid=cWQD8qxoNJy_fecAaN2k8N0IQ6bkQ7f3AtzPx4bWL6wcAmO0T809!-1120047000"
}
response = cl.post(url, data=payload, headers=headers)
tables = pd.read_html(response.text)
When just clicking through the pages it looks like there's some sort of cookie/session stuff going on that might be difficult to take into account when using requests.
(Example: http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces;jsessionid=8AkD5D0IDxiiwQzX6KqkB2WIYRjIQb2TIERO1lbP35ClUgzmBNkc!-1120047000)
It might be easier to code this up using selenium since that will automate the browser (and take care of all the headers and whatnot). You'll still have access to the html to be able to scrape what you need. You can probably reuse a lot of what you're doing as well in selenium.

Categories