Python html parsing partial class names

Python html parsing partial class names - python

I am trying to parse a webpage with bs4 but the elements I am trying to access all have different class names.
Example: class='list-item listing … id-12984' and class='list-item listing … id-10359'
def preownedaston(url):
preownedaston_resp = requests.get(url)
if preownedaston_resp.status_code == 200:
bs = BeautifulSoup(preownedaston_resp.text, 'lxml')
posts = bs.find_all('div', class_='') #don't know what to put here
for p in posts:
title_year = p.find('div', class_='inset').find('a').find('span', class_='model_year').text
print(title_year)
preownedaston('https://preowned.astonmartin.com/preowned-cars/search/?finance%5B%5D=price&price-currency%5B%5D=EUR&custom-model%5B404%5D%5B%5D=809&continent-country%5B%5D=France&postcode-area=United%20Kingdom&distance%5B%5D=0&transmission%5B%5D=Manual&budget-program%5B%5D=pay&section%5B%5D=109&order=-usd_price&pageId=3760')
Is there a way to parse a partial class name like class_='list-item '?

Css Selector for matching a partial value of a certain attribute is as follows :
div[class*='list-item'] # the * means match the class with this partial value
But if you look at the source code of the page you will see that the content you are trying to scrape is being generated by Javascript So you have three options here
Use Selenium with a headless browser to render the javescript
Look for the Ajax calls and try to simulate them for example this url is the ajax call the website uses to retrieve the data Ajax URL
Look for the data you are trying to scrape into a script tag as follows :
I prefer this one in similar situation because you will be parsing Json
import requests , json
from bs4 import BeautifulSoup
URL = 'https://preowned.astonmartin.com/preowned-cars/search/?finance%5B%5D=price&price-currency%5B%5D=EUR&custom-model%5B404%5D%5B%5D=809&continent-country%5B%5D=France&postcode-area=United%20Kingdom&distance%5B%5D=0&transmission%5B%5D=Manual&budget-program%5B%5D=pay&section%5B%5D=109&order=-usd_price&pageId=3760'
page = requests.get(URL, headers={"User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"})
soup = BeautifulSoup(page.text, 'html.parser')
json_obj = soup.find('script',{'type':"application/ld+json"}).text
#{"#context":"http://schema.org","#graph":[{"#type":"Brand","name":""},{"#type":"OfferCatalog","itemListElement":[{"#type":"Offer","name":"Pre-Owned By Aston Martin","price":"€114,900.00","url":"https://preowned.astonmartin.com/preowned-cars/12984-aston-martin-v12-vantage-v8-volante/","itemOffered":{"#type":"Car","name":"Aston Martin V12 Vantage V8 Volante","brand":"Aston Martin","model":"V12 Vantage","itemCondition":"Used","category":"Used","productionDate":"2010","releaseDate":"2011","bodyType":"6.0 Litre V12","emissionsCO2":"388","fuelType":"Obsidian Black","mileageFromOdometer":"42000","modelDate":"2011","seatingCapacity":"2","speed":"190","vehicleEngine":"6l","vehicleInteriorColor":"Obsidian Black","color":"Black"}},{"#type":"Offer","name":"Pre-Owned By Aston Martin","price":"€99,900.00","url":"https://preowned.astonmartin.com/preowned-cars/10359-aston-martin-v12-vantage-carbon-edition-coupe/","itemOffered":{"#type":"Car","name":"Aston Martin V12 Vantage Carbon Edition Coupe","brand":"Aston Martin","model":"V12 Vantage","itemCondition":"Used","category":"Used","productionDate":"2011","releaseDate":"2011","bodyType":"6.0 Litre V12","emissionsCO2":"388","fuelType":"Obsidian Black","mileageFromOdometer":"42000","modelDate":"2011","seatingCapacity":"2","speed":"190","vehicleEngine":"6l","vehicleInteriorColor":"Obsidian Black","color":"Black"}}]},{"#type":"BreadcrumbList","itemListElement":[{"#type":"ListItem","position":"1","item":{"#id":"https://preowned.astonmartin.com/","name":"Homepage"}},{"#type":"ListItem","position":"2","item":{"#id":"https://preowned.astonmartin.com/preowned-cars/","name":"Pre-Owned Cars"}},{"#type":"ListItem","position":"3","item":{"#id":"//preowned.astonmartin.com/preowned-cars/search/","name":"Pre-Owned By Aston Martin"}}]}]}
items = json.loads(json_obj)['#graph'][1]['itemListElement']
for item in items :
print(item['itemOffered']['name'])
Output:
Aston Martin V12 Vantage V8 Volante
Aston Martin V12 Vantage Carbon Edition Coupe

The information from this URL actually comes back in JSON format which means you can easily extract the fields you want. For example:
import requests
url = "https://preowned.astonmartin.com/ajax/stock-listing/get-items/pageId/3760/ratio/3_2/taxBandImageLink/aHR0cHM6Ly9kMnBwMTFwZ29wNWY2cC5jbG91ZGZyb250Lm5ldC9UYXhCYW5kLSV0YXhfYmFuZCUuanBn/taxBandImageHyperlink/JWRlYWxlcl9lbWFpbCU=/imgWidth/767/?finance%5B%5D=price&price-currency%5B%5D=EUR&custom-model%5B404%5D%5B%5D=809&continent-country%5B%5D=France&distance%5B%5D=0&transmission%5B%5D=Manual&budget-program%5B%5D=pay&section%5B%5D=109&order=-usd_price&pageId=3760"
r = requests.get(url)
data = r.json()
details = ['make', 'mileage', 'model', 'model_year', 'mpg', 'exterior_colour', 'price_now']
for vehicle in data['vehicles']:
print()
for key in details:
print(f"{key:18} : {vehicle[key]}")
This displays the following:
make : Aston Martin
mileage : 42,000 km
model : V12 Vantage
model_year : 2011
mpg : 17.3
exterior_colour : Carbon Black
price_now : €114,900
make : Aston Martin
mileage : 42,000 km
model : V12 Vantage
model_year : 2011
mpg : 17.3
exterior_colour : Carbon Black
price_now : €99,900
Note: it might be necessary to add a user agent request header if the data is not returned. If you display data you can see all of the available information for each vehicle.
This approach avoids the need to have javascript processing via Selenium and also avoids needing to parse any HTML using BeautifulSoup. The URL was found using the browser's network tools whilst the page was loading.

Related

I can't spot some of the elements in the site's source code

I was trying to scrape this website to get the player data.
https://mystics.wnba.com/roster/
I viewed the code using 'Inspect' but the main table isn't in the source code. For example, this is the code for the first player's name:
<div class="content-table__player-name">
<a ng-href="https://www.wnba.com/player/ariel-atkins/" target="_self" href="https://www.wnba.com/player/ariel-atkins/">Ariel Atkins</a>
</div>
I can't find this piece of code (or any code for the player data) in the page source. I searched for most of the table's divs in the source code but I couldn't find any of them.

The content is generated on the fly, using some JavaScript. To get the data you want, your program need to be able to run and interpret JavaScript. You can use tools like Selenium or the headless mode of Chrome, to extract the DOM from a running browser.
In Firefox you can press F12 to inspect the DOM that was generated by the JavaScript code. In there, you can locate the desired entries. You can also inspect the Network tab, which shows you the requests the site is sending to the server. You might be able identify the requests that return your desired results.

As the tag contains scrapy. So, here is a solution using scrapy.
import scrapy
import json
class Test(scrapy.Spider):
name = 'test'
start_urls = ['https://data.wnba.com/data/5s/v2015/json/mobile_teams/wnba/2021/teams/mystics_roster.json']
def parse(self, response):
data = json.loads(response.body)
data = data.get('t').get('pl')
for player in data:
print(player.get('fn'),player.get('ln'))

The following is how you can access the content using requests module.
import requests
link = 'https://data.wnba.com/data/5s/v2015/json/mobile_teams/wnba/2021/teams/mystics_roster.json'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
res = s.get(link)
for item in res.json()['t']['pl']:
print(item['fn'],item['ln'])
Output:
Leilani Mitchell
Shavonte Zellous
Tina Charles
Elena Delle Donne
Theresa Plaisance
Natasha Cloud
Shatori Walker-Kimbrough
Sydney Wiese
Erica McCall
Ariel Atkins
Myisha Hines-Allen
Megan Gustafson

Identifying issue in retrieving href from Google Scholar

Having trouble scraping links and article names from google scholar. I'm unsure if the issue is with my code or the xpath that I'm using to retrieve the data – or possibly both?
I've already spent the past few hours trying to debug/consulting other stackoverflow queries but to no success.
import scrapy
from scrapyproj.items import ScrapyProjItem
class scholarScrape(scrapy.Spider):
name = "scholarScraper"
allowed_domains = "scholar.google.com"
start_urls=["https://scholar.google.com/scholar?hl=en&oe=ASCII&as_sdt=0%2C44&q=rare+disease+discovery&btnG="]
def parse(self,response):
item = ScrapyProjItem()
item['hyperlink'] = item.xpath("//h3[class=gs_rt]/a/#href").extract()
item['name'] = item.xpath("//div[#class='gs_rt']/h3").extract()
yield item
The error messages I have been receiving say: "AttributeError: xpath" so I believe that the issue lies with the path that I'm using to try and retrieve the data, but I could also be mistaken?

Adding my comment as an answer, as it solved the problem:
The issue is with scrapyproj.items.ScrapyProjItem objects: they do not have an xpath attribute. Is this an official scrapy class? I think you meant to call xpath on response:
item['hyperlink'] = response.xpath("//h3[class=gs_rt]/a/#href").extract()
item['name'] = response.xpath("//div[#class='gs_rt']/h3").extract()
Also, the first path expression might need a set of quotes around the attribute value "gs_rt":
item['hyperlink'] = response.xpath("//h3[class='gs_rt']/a/#href").extract()
Apart from that, the XPath expressions are fine.

Alternative solution using bs4:
from bs4 import BeautifulSoup
import requests, lxml, os
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://scholar.google.com/citations?hl=en&user=m8dFEawAAAAJ', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
# Container where all articles located
for article_info in soup.select('#gsc_a_b .gsc_a_t'):
# title CSS selector
title = article_info.select_one('.gsc_a_at').text
# Same title CSS selector, except we're trying to get "data-href" attribute
# Note, it will be relative link, so you need to join it with absolute link after extracting.
title_link = article_info.select_one('.gsc_a_at')['data-href']
print(f'Title: {title}\nTitle link: https://scholar.google.com{title_link}\n')
# Part of the output:
'''
Title: Automating Gödel's Ontological Proof of God's Existence with Higher-order Automated Theorem Provers.
Title link: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=m8dFEawAAAAJ&citation_for_view=m8dFEawAAAAJ:-f6ydRqryjwC
'''
Alternatively, you can do the same with Google Scholar Author Articles API from SerpApi.
The main difference is that you don't have to think about finding good proxies, trying to solve CAPTCHA even if you're using selenium. It's a paid API with a free plan.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar_author",
"author_id": "9PepYk8AAAAJ",
}
search = GoogleSearch(params)
results = search.get_dict()
for article in results['articles']:
article_title = article['title']
article_link = article['link']
# Part of the output:
'''
Title: p-GaN gate HEMTs with tungsten gate metal for high threshold voltage and low gate current
Link: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=9PepYk8AAAAJ&citation_for_view=9PepYk8AAAAJ:bUkhZ_yRbTwC
'''
Disclaimer, I work for SerpApi.

How to grab text data from Google search info bar

I need to grab text data from google search engine info bar. If someone use a keyword "siemens" to search on google search engine. A small info bar appear right side one the google search result. I want to collect some text information for that info bar. How can I do that using requests and Beautifulsoup. here some on the code I write.
from bs4 import BeautifulSoup as BS
import requests
from googlesearch import search
from googleapiclient.discovery import build
url = 'https://www.google.com/search?ei=j-iKXNDxDMPdwALdwofACg&q='
com = 'siemens'
#for url in search(com, tld='de', lang='de', stop=10):
# print(url)
response = requests.get(url+com)
soup = BS(response.content, 'html.parser')
Red marked area is info bar

You can use the find function in BeautifuLSoup to retrieve all the elements with a given class name, id, css selector, xpath etc. If you inspect the info bar (right click on it and give 'inspect') you can find the unique class name or id for that bar. Use that to filter the info bar alone from your entire html parsed by BeautifulSoup.
Check out find() and findall() in BeautifulSoup to achieve your output. Always go for finding by id first, since every id is unique to an html element. If there is no id for that, then go for the other options.
To obtain the URL, use google.com/search?q=[] with your search query inside []. For queries with more than one word, use a '+' inbetween

Make sure you're using user-agent to fake real user visit, otherwise it might lead to a blocked request from Google. List of user-agents.
To visually select elements from a page, you can use SelectorGadgets Chrome extension to grab CSS selectors.
Code and example in online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get('https://www.google.com/search?q=simens', headers=headers).text
soup = BeautifulSoup(response, 'lxml')
title = soup.select_one('.SPZz6b h2').text
subtitle = soup.select_one('.wwUB2c span').text
website = soup.select_one('.ellip .ellip').text
snippet = soup.select_one('.Uo8X3b+ span').text
print(f'{title}\n{subtitle}\n{website}\n{snippet}')
Output:
Siemens
Automation company
siemens.com
Siemens AG is a German multinational conglomerate company headquartered in Munich and the largest industrial manufacturing company in Europe with branch offices abroad.
Alternatively, you can use Google Search Engine Results API from SerpApi. It's a paid API with a free plan.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "simens",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
title = results["knowledge_graph"]["title"]
subtitle = results["knowledge_graph"]["type"]
website = results["knowledge_graph"]["website"]
snippet = results["knowledge_graph"]["description"]
print(f'{title}\n{subtitle}\n{website}\n{snippet}')
Output:
Siemens
Automation company
http://www.siemens.com/
Siemens AG is a German multinational conglomerate company headquartered in Munich and the largest industrial manufacturing company in Europe with branch offices abroad.
Disclaimer, I work at SerpApi.

Python Scrapy is not getting all html elements from a webpage

I am trying to use Scrapy to get the names of all current WWE superstars from the following url: http://www.wwe.com/superstars
However, when I run my scraper, it does not return any names. I believe (through attempting the problem with other modules) that the problem is that Scrapy is not finding all of the html elements from the page. I attempted the problem with requests and Beautiful Soup, and when I looked at the html that requests got, it was missing important aspects of the html that I was seeing in my browsers inspector. The html containing the names looks like this:
<div class="superstars--info"> == $0
<span class="superstars--name">name here</span>
</div>
My code is posted below. Is there something that I am doing wrong that is causing this not to work?
import scrapy
class SuperstarSpider(scrapy.Spider):
name = "star_spider"
start_urls = ["http://www.wwe.com/superstars"]
def parse(self, response):
star_selector = '.superstars--info'
for star in response.css(star_selector):
NAME_SELECTOR = 'span ::text'
yield {
'name' : star.css(NAME_SELECTOR).extract_first(),
}

Sounds like the site has dynamic content which maybe loaded using javascript and/or xhr calls. Look into splash it's a javascript render engine that behaves a lot like phantomjs. If you know how to use docker, splash is super simple to setup. After you have splash setup, you'll have to integrate it with scrapy by using the scrapy-splash plugin.

Since the content is javascript generated, you have two options: use something like selenium to mimic a browser and parse the html content, or if you can, query an API directly.
In this case, this simple solution works:
import requests
import json
URL = "http://www.wwe.com/api/superstars"
with requests.session() as s:
s.headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0'}
resp = s.get(URL).json()
for x in resp['talent'][:10]:
print(x['name'])
Output (first 10 records):
Abdullah the Butcher
Adam Bomb
Adam Cole
Adam Rose
Aiden English
AJ Lee
AJ Styles
Akam
Akeem
Akira Tozawa

Missing Attribute in a Python HTML request

I'm a newbie to Python and I'm actually working on a little Python script that request and read the HTML of an URL.
For Information the web page that i'm working on is http://bitcoinity.org/markets ,
I would like with my script to fetch the Current Price of the market.
I checked the HTML code and i found that the Price was in a balise :
<span id="last_price" value="447.77"</span>
Here is the code of my Python script :
import urllib2
import urllib
from bs4 import BeautifulSoup
url = "http://bitcoinity.org/markets"
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
HTML = urllib2.urlopen(req)
soup = BeautifulSoup(HTML)
HTMLText = HTML.read()
HTML.close()
#print soup.prettify()
#print HTMLText
So the problem is that the output of this script ( with the 2 methods BeautifulSoup and read() ) is like this :
</span>
<span id="last_price">
</span>
The "value=" attribute is missing and the syntax changed , so I don't know if the server doesn't allow me to make a request of this value or if there is a problem with my code.
All Help is welcome ! :)
( Sorry for my bad english , i'm not a native )

The price is calculated via a set of javascript functions, urllib2+BeautifulSoup approach would not work in this case.
Consider using a tool that utilizes a real browser, like selenium:
>>> from selenium import webdriver
>>> driver = webdriver.Firefox()
>>> driver.get('http://bitcoinity.org/markets')
>>> driver.find_element_by_id('last_price').text
u'0.448'

I'm not sure beautifulsoup or selenium are the tools for this task. They're actually a very poor solution.
Since we're talking about "stock" prices (bitcoin in this case), it is much better if you feed your app/script with real-time market data. Bitcoinity's default "current price" is actually Bitstamp's price... You can also get it directly from the Bitstamp's API via 2 ways.
HTTP API
Here's the ticker you need to feed your app with: https://www.bitstamp.net/api/ticker/ and here how you can get the last price (It is the 'last' value of that JSON what you really are looking for)
import urllib2
import json
req = urllib2.Request("https://www.bitstamp.net/api/ticker/")
opener = urllib2.build_opener()
f = opener.open(req)
json = json.loads(f.read())
print 'Bitcoin last price is = '+json['last']
Websockets API
This is how bitcoinity, bitcoinwisdom, etc grab the prices and market info in order to show it to you in real-time. For this you'll need pusher package for python, since Bitstamp uses pusher for websockets.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python html parsing partial class names - python

Related

I can't spot some of the elements in the site's source code

Identifying issue in retrieving href from Google Scholar

How to grab text data from Google search info bar

Python Scrapy is not getting all html elements from a webpage

Missing Attribute in a Python HTML request

Categories

Resources