Parsing just first result with beautiful soup

Parsing just first result with beautiful soup - python

I have the following code which succesfully pulls links, titles, etc. for podcast episodes. How would I go about just pulling the first one it comes to (i.e. the latest episode) and then immediately stop and produce just that result? Any advice would be greatly appreciated.
def get_playable_podcast(soup):
"""
#param: parsed html page
"""
subjects = []
for content in soup.find_all('item'):
try:
link = content.find('enclosure')
link = link.get('url')
print "\n\nLink: ", link
title = content.find('title')
title = title.get_text()
desc = content.find('itunes:subtitle')
desc = desc.get_text()
thumbnail = content.find('itunes:image')
thumbnail = thumbnail.get('href')
except AttributeError:
continue
item = {
'url': link,
'title': title,
'desc': desc,
'thumbnail': thumbnail
}
subjects.append(item)
return subjects
def compile_playable_podcast(playable_podcast):
"""
#para: list containing dict of key/values pairs for playable podcasts
"""
items = []
for podcast in playable_podcast:
items.append({
'label': podcast['title'],
'thumbnail': podcast['thumbnail'],
'path': podcast['url'],
'info': podcast['desc'],
'is_playable': True,
})
return items

The answer of #John Gordon is completely correct.
#John Gordon pointed out that:
soup.find()
will always display the first found item (for you thats perfectly fine, when you want to scrape the "latest episode").
However, imagine you just wanted to select the second, third, fourth, etc. item of your BeautifulSoup. Then you could do that with the following line of code:
soup.find()[0] # This will works the same way as soup.find() and displays the first item
When you replace the 0 by any other number (e.g. 4) you solely get the choosen (in this example fourth) item ;).

Related

python - scrapy, one by one print

I am working on google search crawling
Here is my code
def parse(self, response):
all_page = response.xpath('//*[#id="main"]')
for page in all_page:
title = page.xpath('//*[#id="main"]/div/div/div/a/h3/div/text()').extract()
link = page.xpath('//*[#id="main"]/div/div/div/a/#href').extract()
print('title', title)
print('link', link)
output is
title ['iPhone - Compare Models - Apple', 'iPhone - Compare Models - Apple (MY)',......]
link[https://www.apple.com/iphone/compare/&sa=U&ved=2ahUKEwiKvsnnmLDxAhWZIDQIHXEdA60QFjAGegQIBRAB&usg=AOvVaw1FCyWoMh1LcbM65W6l8ypN', '/url?q=https://www.apple.com/my/iphone/compare/&sa=U&ved=2ahUKEwiKvsnnmLDxAhWZIDQIHXEdA60QFjAHegQICBAB&usg=AOvVaw3i33ED_sBrbAuNLAJsOlxe',....]
I want like this
title : 'iPhone - Compare Models - Apple'
Link : https://www.apple.com/iphone/compare/&sa=U&ved=2ahUKEwiKvsnnmLDxAhWZIDQIHXEdA60QFjAGegQIBRAB&usg=AOvVaw1FCyWoMh1LcbM65W6l8ypN'
title : ''iPhone - Compare Models - Apple (MY)'
Link :https://www.apple.com/my/iphone/compare/&sa=U&ved=2ahUKEwiKvsnnmLDxAhWZIDQIHXEdA60QFjAHegQICBAB&usg=AOvVaw3i33ED_sBrbAuNLAJsOlxe'
How to do that?
Thank you

extract() method of xpath extracts all items found by the search expression, so with page.xpath('//*[#id="main"]/div/div/div/a/h3/div/text()').extract() you get all findings just for this element.
You can zip them together, as mentioned in another answer, but to be more precise you need to start from the parent object.
for page in all_page:
for element in page.xpath('//*[#id="main"]/div/div/div'):
title = element.xpath('a/h3/div/text()'). extract_first()
link = element.xpath('a/#href').extract_first()
print('title', title)
print('link', link)

Your title and link results are lists with several items, so you need to loop over them, in parallel with zip assuming you get one link per title.
It looks like it from your example, but make sure it is the case.
def parse(self, response):
all_page = response.xpath('//*[#id="main"]')
for page in all_page:
titles = page.xpath('//*[#id="main"]/div/div/div/a/h3/div/text()').extract()
links = page.xpath('//*[#id="main"]/div/div/div/a/#href').extract()
for title, link in zip(titles, links):
print (f"title: '{title}'\n\n"
f"Link: {link}")

How Do I Find Amazon Product Names with requests-html?

I've been trying to code a program in python that can return a list of all the product names on the first page. I have a function that gets the URL based on what you want to search:
def get_url(search_term):
template = 'https://www.amazon.com/s?k={}&ref=nb_sb_noss_1'
search_term = search_term.replace(' ', '+')
url = template.format(search_term)
print(url)
return URL
Then I pass the URL into another function and here is where I need help. Right now my function to retrieve the title and number of reviews is this:
def getInfo(url):
r = HTMLSession().get(url)
r.html.render()
product = {
'title': r.html.find('.a-size-medium' '.a-color-base' '.a-text-normal', first=True).text,
'reviews': r.html.find('.a-size-base', first=True).text
}
print(product)
However, the r.html.find part isn't getting the info I need, it either returns [] or None if I add first=True. I've tried different ways like using the XPath and selector. None of those seemed to work. Can anyone help find a way to use html.find method to find all the product names and save them in title in the dictionary product?

Combining BeautifulSoup and json into one output

I have probably not explained my question well but as this is new to me... Anyway, I need to combine these two pieces of code.
I can get the BS working, but it uses the wrong image. To get the right fields and the right image, I have to parse the json part of the website and therefore BS won't work.
The json parsing here
import json
import urllib
import requests
import re
r = urllib.urlopen('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1')
data = json.loads(r.read().decode('utf-8'))
for post in data['posts']:
print post['episodeNumber']
print post['title']
print post['audioSource']
print post['image']['medium']
print post['content']
And replace the try / BS part here:
def get_playable_podcast(soup):
"""
#param: parsed html page
"""
subjects = []
for content in soup.find_all('item'):
try:
link = content.find('enclosure')
link = link.get('url')
print "\n\nLink: ", link
title = content.find('title')
title = title.get_text()
desc = content.find('itunes:subtitle')
desc = desc.get_text()
thumbnail = content.find('itunes:image')
thumbnail = thumbnail.get('href')
except AttributeError:
continue
item = {
'url': link,
'title': title,
'desc': desc,
'thumbnail': thumbnail
}
subjects.append(item)
return subjects
def compile_playable_podcast(playable_podcast):
"""
#para: list containing dict of key/values pairs for playable podcasts
"""
items = []
for podcast in playable_podcast:
items.append({
'label': podcast['title'],
'thumbnail': podcast['thumbnail'],
'path': podcast['url'],
'info': podcast['desc'],
'is_playable': True,
})
return items
I have tried all sorts of variations with passing through the output to the items section, but the most common error I get is. I just have no idea how to pass the data from the json through.
Error Type: <type 'exceptions.NameError'>
Error Contents: name 'title' is not defined
Traceback (most recent call last):
File ".../addon.py", line 6, in <module>
from resources.lib import thisiscriminal
File "....resources/lib/thisiscriminal.py", line 132, in <module>
'title': title,
NameError: name 'title' is not defined

Your JSON request should contain all the information you need. You should print json_data and take a look at what is returned and decide which parts you need.
Based on what your other code was looking for, the following code shows how you could extract some of the fields:
import requests
r = requests.get('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1')
json_data = r.json()
items = []
for post in json_data['posts']:
items.append([
post['title'].encode('utf-8'),
post['image']['thumb'],
post['excerpt']['long'],
post['permalink'],
])
for item in items:
print item
This would give you output starting:
['Stowaway', u'https://thisiscriminal.com/wp-content/uploads/2019/07/Stowaway_art-150x150.png', u'One day in 1969, Paulette Cooper decided to see what she could get away with.', u'https://thisiscriminal.com/episode-118-stowaway-7-5-2019/']
['The Lake', u'https://thisiscriminal.com/wp-content/uploads/2019/06/Lake_art-150x150.png', u'Amanda Hamm and her boyfriend Maurice LaGrone drove to Clinton Lake one night in 2003. The next day, DeWitt County Sheriff Roger Massey told a local newspaper, \u201cWe don\u2019t want to blow this up into something that it\u2019s not. But on the other side, we\u2019ve got three children...', u'https://thisiscriminal.com/episode-117-the-lake-6-21-2019/']
['Jessica and the Bunny Ranch', u'https://thisiscriminal.com/wp-content/uploads/2019/06/Bunny_art-150x150.png', u'In our\xa0last episode\xa0we spoke Cecilia Gentili, a trans Latina who worked for many years as an undocumented sex worker. Today, we get two more views of sex work in America. We speak with a high-end escort in New York City, and take a trip to one of the...', u'https://thisiscriminal.com/episode-116-jessica-and-the-bunny-ranch-6-7-2019/']

python - How would i scrape this website for specific data that's constantly changing/being updated?

the website is:
https://pokemongo.gamepress.gg/best-attackers-type
my code is as follows, for now:
from bs4 import BeautifulSoup
import requests
import re
site = 'https://pokemongo.gamepress.gg/best-attackers-type'
page_data = requests.get(site, headers=headers)
soup = BeautifulSoup(page_data.text, 'html.parser')
check_gamepress = soup.body.findAll(text=re.compile("Strength"))
print(check_gamepress)
However, I really want to scrape certain data, and I'm really having trouble.
For example, how would I scrape the portion that show's the following for best Bug type:
"Good typing and lightning-fast attacks. Though cool-looking, Scizor is somewhat fragile."
This information could obviously be updated as it has been in the past, when a better Pokemon comes out for that type. So, how would I scrape this data where it'll probably be updated in the future, without me having to make code changes when that occurs.
In advance, thank you for reading!

This particular site is a bit tough because of how the HTML is organized. The relevant tags containing the information don't really have many distinguishing features, so we have to get a little clever. To make matters complicated, the divs the contain the information across the whole page are siblings. We'll also have to make up for this web-design travesty with some ingenuity.
I did notice a pattern that is (almost entirely) consistent throughout the page. Each 'type' and underlying section are broken into 3 divs:
A div containing the type and pokemon, for example Dark Type: Tyranitar.
A div containing the 'specialty' and moves.
A div containing the 'ratings' and commentary.
The basic idea that follows here is that we can begin to organize this markup chaos through a procedure that loosely goes like this:
Identify each of the type title divs
For each of those divs, get the other two divs by accessing its siblings
Parse the information out of each of those divs
With this in mind, I produced a working solution. The meat of the code consists of 5 functions. One to find each section, one to extract the siblings, and three functions to parse each of those divs.
import re
import json
import requests
from pprint import pprint
from bs4 import BeautifulSoup
def type_section(tag):
"""Find the tags that has the move type and pokemon name"""
pattern = r"[A-z]{3,} Type: [A-z]{3,}"
# if all these things are true, it should be the right tag
return all((tag.name == 'div',
len(tag.get('class', '')) == 1,
'field__item' in tag.get('class', []),
re.findall(pattern, tag.text),
))
def parse_type_pokemon(tag):
"""Parse out the move type and pokemon from the tag text"""
s = tag.text.strip()
poke_type, pokemon = s.split(' Type: ')
return {'type': poke_type, 'pokemon': pokemon}
def parse_speciality(tag):
"""Parse the tag containing the speciality and moves"""
table = tag.find('table')
rows = table.find_all('tr')
speciality_row, fast_row, charge_row = rows
speciality_types = []
for anchor in speciality_row.find_all('a'):
# Each type 'badge' has a href with the type name at the end
href = anchor.get('href')
speciality_types.append(href.split('#')[-1])
fast_move = fast_row.find('td').text
charge_move = charge_row.find('td').text
return {'speciality': speciality_types,
'fast_move': fast_move,
'charge_move': charge_move}
def parse_rating(tag):
"""Parse the tag containing categorical ratings and commentary"""
table = tag.find('table')
category_tags = table.find_all('th')
strength_tag, meta_tag, future_tag = category_tags
str_rating = strength_tag.parent.find('td').text.strip()
meta_rating = meta_tag.parent.find('td').text.strip()
future_rating = meta_tag.parent.find('td').text.strip()
blurb_tags = table.find_all('td', {'colspan': '2'})
if blurb_tags:
# `if` to accomodate fire section bug
str_blurb_tag, meta_blurb_tag, future_blurb_tag = blurb_tags
str_blurb = str_blurb_tag.text.strip()
meta_blurb = meta_blurb_tag.text.strip()
future_blurb = future_blurb_tag.text.strip()
else:
str_blurb = None;meta_blurb=None;future_blurb=None
return {'strength': {
'rating': str_rating,
'commentary': str_blurb},
'meta': {
'rating': meta_rating,
'commentary': meta_blurb},
'future': {
'rating': future_rating,
'commentary': future_blurb}
}
def extract_divs(tag):
"""
Get the divs containing the moves/ratings
determined based on sibling position from the type tag
"""
_, speciality_div, _, rating_div, *_ = tag.next_siblings
return speciality_div, rating_div
def main():
"""All together now"""
url = 'https://pokemongo.gamepress.gg/best-attackers-type'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
types = {}
for type_tag in soup.find_all(type_section):
type_info = {}
type_info.update(parse_type_pokemon(type_tag))
speciality_div, rating_div = extract_divs(type_tag)
type_info.update(parse_speciality(speciality_div))
type_info.update(parse_rating(rating_div))
type_ = type_info.get('type')
types[type_] = type_info
pprint(types) # We did it
with open('pokemon.json', 'w') as outfile:
json.dump(types, outfile)
There is, for now, one small wrench in the whole thing. Remember when I said this pattern was almost entirely consistent? Well, the Fire type is an odd-ball here, because they included two pokemon for that type, so the Fire type results are not correct. I or some brave person may come up with a way to deal with that. Or maybe they'll decide on one fire pokemon in the future.
This code, the resulting json (prettified), and an archive of HTML response used can be found in this gist.

error 400 - bad request when trying to set q search parameter in Soundcloud with Scrapy item

Im trying to search soundcloud for a track related to an artists name. It works perfectly if i just type an artist name in the q search parameter, however i want it to use an item ['artist'] variable. I think theres prbably a simple programming error which is causing the 'bad request' error. Here is the relevant code. Thanks guys!!
def parse_me(self, response):
for info in response.xpath('//div[#class="entry vevent"]'):
item = TutorialItem() # Extract items from the items folder.
item ['artist'] = info.xpath('.//span[#class="summary"]//text()').extract() # Extract artist information.
#item ['genre'] = info.xpath('.//div[#class="header"]//text()').extract()
yield item # Retreive items in item
client = soundcloud.Client(client_id='xxxx', client_secret='xxxx', callback='http://localhost:9000/#/callback.html')
tracks = client.get('/tracks', q=item['artist'], limit=1)
for track in tracks:
print track.id
item ['trackz'] = track.id
yield item

info.xpath('.//span[#class="summary"]//text()').extract() returns a list. Your q parameter probably requires a string.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing just first result with beautiful soup - python

Related

python - scrapy, one by one print

How Do I Find Amazon Product Names with requests-html?

Combining BeautifulSoup and json into one output

python - How would i scrape this website for specific data that's constantly changing/being updated?

error 400 - bad request when trying to set q search parameter in Soundcloud with Scrapy item

Categories

Resources