Python-Requests Scraping YouTube description with BS4 issue

Python-Requests Scraping YouTube description with BS4 issue - python

I'm trying to get both the text and the links as shown in the picture. But I can only get the text via siblings and the links after. I need them to come together like in the image. I tried using br.next_element but it doesn't grab the a-links. What am I missing?
import requests
from bs4 import BeautifulSoup
url_id = 'aM7aW0G58CI'
s = requests.Session()
r = s.get('https://www.youtube.com/watch?v='+url_id)
html = r.text
soup = BeautifulSoup(html, 'lxml')
for i in soup.find_all('p', id='eow-description'):
for br in i.find_all('br'):
next_sib = br.next_sibling
print(next_sib)
for i in soup.find_all('p', id='eow-description'):
for a in i.find_all('a'):
print(a.text)
This is the output that I am getting. I'm not getting what the screenshot below shows.
OutPut:
Special shout to
Wanna support what we do? Livestream at 2PM PT!:
It Wasn’t Me, I Swear!:
TheDeFrancoFam Vlog:
————————————
CATCH UP ON THIS WEEK’S SHOWS:
<br/>
Why People Are Freaking Out About The Trump NFL Boycott and Anthony Weiner Going to Jail…:
WOW! Dirty Advertising Exposed And Major Backlash Following Unexpected Compromise…:
Why Trump's "HUGE Failure" Is A Massive Loss For His Enemies and A Shocking Change To Women's Rights:
DISGUSTING! The Horrible Truth About Belle Gibson Exposed, Controversial Video Blows Up, and More:
<br/>
————————————
GET SOME GEAR:
————————————
FACEBOOK:
TWITTER:
INSTAGRAM:
SNAPCHAT: TheDeFrancoFam
REDDIT:
ITUNES:
GOOGLE PLAY:
————————————
Edited by:
James Girardier -
Jason Mayer -
<br/>
Produced by:
Amanda Morones -
<br/>
Motion Graphics Artist:
Brian Borst -
<br/>
P.O. BOX
Attn: Philip DeFranco
16350 Ventura Blvd
Ste D #542
Encino, CA 91436
http://DKPhil.com
http://DeFrancoElite.com
https://youtu.be/fFxDbYE06zU
https://youtu.be/kR7DquGe4vY
https://youtu.be/qdWUQGHtyPk
https://youtu.be/CWlUs1-7KN4
https://youtu.be/kUWt-oipvOY
https://youtu.be/XVsTh4zxKNo
https://teespring.com/stores/defranco...
http://on.fb.me/mqpRW7
http://Twitter.com/PhillyD
https://instagram.com/phillydefranco/
https://www.reddit.com/r/DeFranco
http://DeFrancoMistakes.com
http://mistakeswithdefranco.com
https://twitter.com/jamesgirardier
https://www.instagram.com/jayjaymay/
https://twitter.com/MandaOhDang
https://twitter.com/brianjborst

Using children and checking tag name (child.name) I made
import requests
from bs4 import BeautifulSoup
url_id = 'aM7aW0G58CI'
s = requests.Session()
r = s.get('https://www.youtube.com/watch?v='+url_id)
soup = BeautifulSoup(r.text, 'lxml')
# to concatenate <br>
br = ''
for p in soup.find_all('p', id='eow-description'):
for child in p.children:
if child.name == 'a':
#print(' a:', child.text)
print(br, child.text)
br = '' # reset br
elif child.name == 'br':
if child.next_sibling.name != 'br': # skip <br/> ?
#print('br:', child.next_sibling)
br += str(child.next_sibling)
#else:
# print(child.name, child)
I get:
Special shout to http://DKPhil.com
Wanna support what we do? Livestream at 2PM PT!: http://DeFrancoElite.com
It Wasn’t Me, I Swear!: https://youtu.be/fFxDbYE06zU
TheDeFrancoFam Vlog: https://youtu.be/kR7DquGe4vY
———————————— CATCH UP ON THIS WEEK’S SHOWS: Why People Are Freaking Out About The Trump NFL Boycott and Anthony Weiner Going to Jail…: https://youtu.be/qdWUQGHtyPk
WOW! Dirty Advertising Exposed And Major Backlash Following Unexpected Compromise…: https://youtu.be/CWlUs1-7KN4
Why Trump's "HUGE Failure" Is A Massive Loss For His Enemies and A Shocking Change To Women's Rights: https://youtu.be/kUWt-oipvOY
DISGUSTING! The Horrible Truth About Belle Gibson Exposed, Controversial Video Blows Up, and More: https://youtu.be/XVsTh4zxKNo
————————————GET SOME GEAR: https://teespring.com/stores/defranco...
————————————FACEBOOK: http://on.fb.me/mqpRW7
TWITTER: http://Twitter.com/PhillyD
INSTAGRAM: https://instagram.com/phillydefranco/
SNAPCHAT: TheDeFrancoFamREDDIT: https://www.reddit.com/r/DeFranco
ITUNES: http://DeFrancoMistakes.com
GOOGLE PLAY: http://mistakeswithdefranco.com
————————————Edited by:James Girardier - https://twitter.com/jamesgirardier
Jason Mayer - https://www.instagram.com/jayjaymay/
Produced by:Amanda Morones - https://twitter.com/MandaOhDang
Motion Graphics Artist:Brian Borst - https://twitter.com/brianjborst
EDIT: you may have to use
else:
print(child.name, child)
to get PO BOX address

I found a really simple way:
for p in soup.find_all('p', id='eow-description'):
print(p.get_text('\n'))
Only issue now is that some of the links are stripped with ...
You can also play around with youtube-dl python module to get the description of a youtube video that way as well.

I have found this way..
import pafy
url='https://www.youtube.com/watch?v=aM7aW0G58CI'
vid=pafy.new(url)
print(vid.description)
By this method, you will get your content in the exact same way as shown in Youtube's video description.

Related

Web-scraping: unable to extract the required text

I am trying to extract the novel description from this url https://www.wuxiaworld.co/Horizon-Bright-Moon-Sabre/
Howevery, when I try this code:
html=requests.get(site)
html.encoding = html.apparent_encoding
soup = BeautifulSoup(html.text,"html.parser")
summary = soup.find(id ='intro').get_text()
print (summary)
I get:
Description
Process finished with exit code 0
Any help would be appreciated, thanks in advance.

Try this:
site = "https://www.wuxiaworld.co/Horizon-Bright-Moon-Sabre/"
html = requests.get(site)
soup = BeautifulSoup(html.content)
summary = soup.find(id ='intro')
print(summary.text)
This prints out:
Description Fu Hongxue was a cripple, born with a lame leg and subject
to epileptic seizures. He was also one of the most powerful, legendary
figures of the martial arts world, with a dull black saber that was
second to none. His fame made him a frequent target of challengers,
but whenever his saber left its sheath, only corpses would remain in
its wake. One day, however, F...

Get Text from h1 with BeautifulSoup

I was asked to get a product name from a web.
I was asked to get this text:
SEIKO 5 AUTOMATIC MENS STEEL VINTAGE JAPAN MADE BLACK DIAL WATCH RUN ORDER K
This is my BeautifulSoup code:
import requests
from bs4 import BeautifulSoup
get = requests.get('https://www.ebay.com/itm/SEIKO-5-AUTOMATIC-MENS-STEEL-VINTAGE-JAPAN-MADE-BLACK-DIAL-WATCH-RUN-ORDER-K/143420840058?epid=18032713872&_trkparms=ispr%3D1&hash=item21648c587a:g:ZzEAAOSw9MRdsI8v&enc=AQAEAAACQBPxNw%2BVj6nta7CKEs3N0qVBgKB1sCHq6imZgPqwOxGc8125XNy2Dq0slMe8clDZgTSnJdS4K5F5NyTF%2FwJExAng2G2%2FdtRUNYEnKcxoo4WXaAM5K%2BUxqDKTnmNGfgjTzpWCdoE50XlC7BXz3bBrJTY0vo62kBVR03HYvJwVCxnu8NEBiz4YMfAlPWDNnP2lVje46p22rKWDem6rHFqpoKtLDVHS8CaQER%2BqJxucEnw14LJIybRkfCmDuobZv%2F4F9Lhrl8xiPp%2Bbk6iRIu3UqqocBO%2FNyxW1aAa8QWkaJqtUy3g6Yue61yMEb0GY3BwO1%2BpVwkTOZLDvYHXZ%2FZEGNu%2F%2BYznes9jNtctDCr9Xv3QECsXyLDEOeo7LHh1srunEoRvK9T0AkS7oT%2BI3%2B%2BtD5fGnpJJu%2FJ3MdktqvgnTwieipeZTrGsHiQ8iL1nWm0CJcMbe2UUELEG%2BLHPNSSkRcUVBWnoPuOE5FjuyFHR1ujG2TgGLfN8HlO6ZyfNWz0K%2Bc4zjo7wBPnJdffcn6p8kLHWhbFyMyIY1Jc8yZBl20mlA29S%2BN%2Bw0e3uZDHK%2BIyCBctbYgGxaQM6Aevcdx0OcXl%2Fy7aDoRTqhBue9OYrAa3fEQf6ObFqtCbiEiXTioQZZJfrC%2FXfbq36oMTuQAFRvH2ahowGoPhSQkE1Jn73QLI%2FGXVynHIG2KdQSbX4eU%2FgoGy9y5WIvvUL9Xxy4ltNvTtCpjg5XlY8VxDv4M2gsLY3C0SRv7LNELk%2FitBSjfuUjzg%3D%3D&checksum=143420840058aa89790ec2164a5caf16644bb1bfd7c8&enc=AQAEAAACQBPxNw%2BVj6nta7CKEs3N0qVBgKB1sCHq6imZgPqwOxGc8125XNy2Dq0slMe8clDZgTSnJdS4K5F5NyTF%2FwJExAng2G2%2FdtRUNYEnKcxoo4WXaAM5K%2BUxqDKTnmNGfgjTzpWCdoE50XlC7BXz3bBrJTY0vo62kBVR03HYvJwVCxnu8NEBiz4YMfAlPWDNnP2lVje46p22rKWDem6rHFqpoKtLDVHS8CaQER%2BqJxucEnw14LJIybRkfCmDuobZv%2F4F9Lhrl8xiPp%2Bbk6iRIu3UqqocBO%2FNyxW1aAa8QWkaJqtUy3g6Yue61yMEb0GY3BwO1%2BpVwkTOZLDvYHXZ%2FZEGNu%2F%2BYznes9jNtctDCr9Xv3QECsXyLDEOeo7LHh1srunEoRvK9T0AkS7oT%2BI3%2B%2BtD5fGnpJJu%2FJ3MdktqvgnTwieipeZTrGsHiQ8iL1nWm0CJcMbe2UUELEG%2BLHPNSSkRcUVBWnoPuOE5FjuyFHR1ujG2TgGLfN8HlO6ZyfNWz0K%2Bc4zjo7wBPnJdffcn6p8kLHWhbFyMyIY1Jc8yZBl20mlA29S%2BN%2Bw0e3uZDHK%2BIyCBctbYgGxaQM6Aevcdx0OcXl%2Fy7aDoRTqhBue9OYrAa3fEQf6ObFqtCbiEiXTioQZZJfrC%2FXfbq36oMTuQAFRvH2ahowGoPhSQkE1Jn73QLI%2FGXVynHIG2KdQSbX4eU%2FgoGy9y5WIvvUL9Xxy4ltNvTtCpjg5XlY8VxDv4M2gsLY3C0SRv7LNELk%2FitBSjfuUjzg%3D%3D&checksum=143420840058aa89790ec2164a5caf16644bb1bfd7c8')
soup = BeautifulSoup(get.text, 'lxml')
company = soup.select('h1.it-ttl')[0].text.strip()
print(company)
The HTML from the code is:
<h1 class="it-ttl" id="itemTitle" itemprop="name">
<span class="g-hdn">Details about
</span>
SEIKO 5 AUTOMATIC MENS STEEL VINTAGE JAPAN MADE BLACK DIAL WATCH RUN ORDER K
</h1>
Instead of the desired text, I get this:
Details about SEIKO 5 AUTOMATIC MENS STEEL VINTAGE JAPAN MADE BLACK DIAL WATCH RUN ORDER K
How can I extract only the product name?

import requests
from bs4 import BeautifulSoup
get = requests.get('https://www.ebay.com/itm/SEIKO-5-AUTOMATIC-MENS-STEEL-VINTAGE-JAPAN-MADE-BLACK-DIAL-WATCH-RUN-ORDER-K/143420840058?epid=18032713872&_trkparms=ispr%3D1&hash=item21648c587a:g:ZzEAAOSw9MRdsI8v&enc=AQAEAAACQBPxNw%2BVj6nta7CKEs3N0qVBgKB1sCHq6imZgPqwOxGc8125XNy2Dq0slMe8clDZgTSnJdS4K5F5NyTF%2FwJExAng2G2%2FdtRUNYEnKcxoo4WXaAM5K%2BUxqDKTnmNGfgjTzpWCdoE50XlC7BXz3bBrJTY0vo62kBVR03HYvJwVCxnu8NEBiz4YMfAlPWDNnP2lVje46p22rKWDem6rHFqpoKtLDVHS8CaQER%2BqJxucEnw14LJIybRkfCmDuobZv%2F4F9Lhrl8xiPp%2Bbk6iRIu3UqqocBO%2FNyxW1aAa8QWkaJqtUy3g6Yue61yMEb0GY3BwO1%2BpVwkTOZLDvYHXZ%2FZEGNu%2F%2BYznes9jNtctDCr9Xv3QECsXyLDEOeo7LHh1srunEoRvK9T0AkS7oT%2BI3%2B%2BtD5fGnpJJu%2FJ3MdktqvgnTwieipeZTrGsHiQ8iL1nWm0CJcMbe2UUELEG%2BLHPNSSkRcUVBWnoPuOE5FjuyFHR1ujG2TgGLfN8HlO6ZyfNWz0K%2Bc4zjo7wBPnJdffcn6p8kLHWhbFyMyIY1Jc8yZBl20mlA29S%2BN%2Bw0e3uZDHK%2BIyCBctbYgGxaQM6Aevcdx0OcXl%2Fy7aDoRTqhBue9OYrAa3fEQf6ObFqtCbiEiXTioQZZJfrC%2FXfbq36oMTuQAFRvH2ahowGoPhSQkE1Jn73QLI%2FGXVynHIG2KdQSbX4eU%2FgoGy9y5WIvvUL9Xxy4ltNvTtCpjg5XlY8VxDv4M2gsLY3C0SRv7LNELk%2FitBSjfuUjzg%3D%3D&checksum=143420840058aa89790ec2164a5caf16644bb1bfd7c8&enc=AQAEAAACQBPxNw%2BVj6nta7CKEs3N0qVBgKB1sCHq6imZgPqwOxGc8125XNy2Dq0slMe8clDZgTSnJdS4K5F5NyTF%2FwJExAng2G2%2FdtRUNYEnKcxoo4WXaAM5K%2BUxqDKTnmNGfgjTzpWCdoE50XlC7BXz3bBrJTY0vo62kBVR03HYvJwVCxnu8NEBiz4YMfAlPWDNnP2lVje46p22rKWDem6rHFqpoKtLDVHS8CaQER%2BqJxucEnw14LJIybRkfCmDuobZv%2F4F9Lhrl8xiPp%2Bbk6iRIu3UqqocBO%2FNyxW1aAa8QWkaJqtUy3g6Yue61yMEb0GY3BwO1%2BpVwkTOZLDvYHXZ%2FZEGNu%2F%2BYznes9jNtctDCr9Xv3QECsXyLDEOeo7LHh1srunEoRvK9T0AkS7oT%2BI3%2B%2BtD5fGnpJJu%2FJ3MdktqvgnTwieipeZTrGsHiQ8iL1nWm0CJcMbe2UUELEG%2BLHPNSSkRcUVBWnoPuOE5FjuyFHR1ujG2TgGLfN8HlO6ZyfNWz0K%2Bc4zjo7wBPnJdffcn6p8kLHWhbFyMyIY1Jc8yZBl20mlA29S%2BN%2Bw0e3uZDHK%2BIyCBctbYgGxaQM6Aevcdx0OcXl%2Fy7aDoRTqhBue9OYrAa3fEQf6ObFqtCbiEiXTioQZZJfrC%2FXfbq36oMTuQAFRvH2ahowGoPhSQkE1Jn73QLI%2FGXVynHIG2KdQSbX4eU%2FgoGy9y5WIvvUL9Xxy4ltNvTtCpjg5XlY8VxDv4M2gsLY3C0SRv7LNELk%2FitBSjfuUjzg%3D%3D&checksum=143420840058aa89790ec2164a5caf16644bb1bfd7c8')
soup = BeautifulSoup(get.text, 'html.parser')
company = soup.select('h1.it-ttl')[0].text.strip()
span_text = soup.select('span.g-hdn')[0].text.strip()
print(company)
print(span_text)
print(company.lstrip(span_text))
Since the span tag is nested in the h1 tag, the necessary step is to extract the span text and remove it from the h1 tag with the lstrip method.

Return empty bracket [ ] when web scraping

I try to print all the titles on nytimes.com. I used requests and beautifulsoup module. But I got empty brackets in the end. The return result is [ ]. How can I fix this problem?
import requests
from bs4 import BeautifulSoup
url = "https://www.nytimes.com/"
r = requests.get(url)
text = r.text
soup = BeautifulSoup(text, "html.parser")
title = soup.find_all("span", "balanceHeadline")
print(title)

I am assuming that you are trying to retrieve the headlines of nytimes. Doing title = soup.find_all("span", {'class':'balancedHeadline'}) will not get you your results. The <span> tag found using the element selector is often misleading. What you have to do is to look into the source code of the page and find the tags wrapped around the title.
For nytimes its a little tricky because the headlines are wrapped in the <script> tag with a lot of junk inside. Hence what you can do is to "clean" it first and deserialize the string by convertinng it into a python dictionary object.
import requests
from bs4 import BeautifulSoup
import json
url = "https://www.nytimes.com/"
r = requests.get(url)
r_html = r.text
soup = BeautifulSoup(r_html, "html.parser")
scripts = soup.find_all('script')
for script in scripts:
if 'preloadedData' in script.text:
jsonStr = script.text
jsonStr = jsonStr.split('=', 1)[1].strip() # remove "window.__preloadedData = "
jsonStr = jsonStr.rsplit(';', 1)[0] # remove trailing ;
jsonStr = json.loads(jsonStr)
for key,value in jsonStr['initialState'].items():
try:
if value['promotionalHeadline'] != "":
print(value['promotionalHeadline'])
except:
continue
outputs
Jeffrey Epstein Autopsy Results Conclude He Hanged Himself
Trump and Netanyahu Put Bipartisan Support for Israel at Risk
Congresswoman Rejects Israel’s Offer of a West Bank Visit
In Tlaib’s Ancestral Village, a Grandmother Weathers a Global Political Storm
Cathay Chief’s Resignation Shows China’s Power Over Hong Kong Unrest
Trump Administration Approves Fighter Jet Sales to Taiwan
Peace Road Map for Afghanistan Will Let Taliban Negotiate Women’s Rights
Debate Flares Over Afghanistan as Trump Considers Troop Withdrawal
In El Paso, Hundreds Show Up to Mourn a Woman They Didn’t Know
Is Slavery’s Legacy in the Power Dynamics of Sports?
Listen: ‘Modern Love’ Podcast
‘The Interpreter’
If You Think Trump Is Helping Israel, You’re a Fool
First They Came for the Black Feminists
How Women Can Escape the Likability Trap
With Trump as President, the World Is Spiraling Into Chaos
To Understand Hong Kong, Don’t Think About Tiananmen
The Abrupt End of My Big-Girl Summer
From Trump Boom to Trump Gloom
What Are Trump and Netanyahu Afraid Of?
King Bibi Bows Before a Tweet
Ebola Could Be Eradicated — But Only if the World Works Together
The Online Mob Came for Me. What Happened to the Reckoning?
A German TV Star Takes On Bullies
Why Is Hollywood So Scared of Climate Change?
Solving Medical Mysteries With Your Help: Now on Netflix

title = soup.find_all("span", "balanceHeadline")
replace it with
title = soup.find_all("span", {'class':'balanceHeadline'})

Difficulty using beautifulsoup in Python to scrape web data from multiple HTML classes

I am using Beautiful Soup in Python to scrape some data from a property listings site.
I have had success in scraping the individual elements that I require but wish to use a more efficient script to pull back all the data in one command if possible.
The difficulty is that the various elements I require reside in different classes.
I have tried the following, so far.
for listing in content.findAll('h2', attrs={"class": "listing-results-attr"}):
print(listing.text)
which successfully gives the following list
15 room mansion for sale
3 bed barn conversion for sale
2 room duplex for sale
1 bed garden shed for sale
Separately, to retrieve the address details for each listing I have used the following successfully;
for address in content.findAll('a', attrs={"class": "listing-results-address"}):
print(address.text)
which gives this
22 Acacia Avenue, CityName Postcode
100 Sleepy Hollow, CityName Postcode
742 Evergreen Terrace, CityName Postcode
31 Spooner Street, CityName Postcode
And for property price I have used this...
for prop_price in content.findAll('a', attrs={"class": "listing-results-price"}):
print(prop_price.text)
which gives...
$350,000
$1,250,000
$750,000
$100,000
This is great however I need to be able to pull back all of this information in a more efficient and performant way such that all the data comes back in one pass.
At present I can do this using something like the code below:
all = content.select("a.listing-results-attr, h2.listing-results-address, a.listing-results-price")
This works somewhat but brings back too much additional HTML tags and is just not nearly as elegant or sophisticated as I require. Results as follows.
</a>, <h2 class="listing-results-attr">
15 room mansion for sale
</h2>, <a class="listing-results-address" href="redacted">22 Acacia Avenue, CityName Postcode</a>, <a class="listing-results-price" href="redacted">
$350,000
Expected results should look something like this:
15 room mansion for sale
22 Acacia Avenue, CityName Postcode
$350,000
3 bed barn conversion for sale
100 Sleepy Hollow, CityName Postcode
$1,250,000
etc
etc
I then need to be able to store the results as JSON objects for later analysis.
Thanks in advance.

Change your selectors as shown below:
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.zoopla.co.uk/for-sale/property/caerphilly/?q=Caerphilly&results_sort=newest_listings&search_source=home'
r = requests.get(url)
soup = bs(r.content, 'lxml')
details = ([item.text.strip() for item in soup.select(".listing-results-attr a, .listing-results-address , .text-price")])
You can view separately with, for example,
prices = details[0::3]
descriptions = details[1::3]
addresses = details[2::3]
print(prices, descriptions, addresses)

find_all() function always returns a list, strip() is remove spaces at the beginning and at the end of the string.
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.zoopla.co.uk/for-sale/property/caerphilly/?q=Caerphilly&results_sort=newest_listings&search_source=home'
r = requests.get(url)
soup = bs(r.content, 'lxml')
results = soup.find("ul",{'class':"listing-results clearfix js-gtm-list"})
for li in results.find_all("li",{'class':"srp clearfix"}):
price = li.find("a",{"class":"listing-results-price text-price"}).text.strip()
address = li.find("a",{'class':"listing-results-address"}).text.strip()
description = li.find("h2",{'class':"listing-results-attr"}).find('a').text.strip()
print(description)
print(address)
print(price)
O/P:
2 bed detached bungalow for sale
Bronrhiw Fach, Caerphilly CF83
£159,950
2 bed semi-detached house for sale
Cwrt Nant Y Felin, Caerphilly CF83
£159,950
3 bed semi-detached house for sale
Pen-Y-Bryn, Caerphilly CF83
£102,950
.....

Missing value while parsing webpage with BeautifulSoup 4

As a practice assignment I am trying to parse this search results page from Amazon using BeautifulSoup library.
Here's my code.
from urllib import urlopen
from bs4 import BeautifulSoup
SourceURL = "http://www.amazon.in/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=android"
ResultsPage = urlopen(SourceURL )
Soup = BeautifulSoup(ResultsPage)
print "<SearchResults>"
for SearchResult in Soup.findAll('li', attrs={'class': 's-result-item celwidget'}):
#Read Result Title
Title = SearchResult.find("h2", {"class": "a-size-medium a-color-null s-inline s-access-title a-text-normal"})
ResultTag = "\t<Result><![CDATA["
if Title is not None:
ResultTag += Title.text
ResultTag += "]]></Result>"
print ResultTag
print "</SearchResults>"
The output displayed is as below
<SearchResults>
<Result><![CDATA[Micromax Bolt S301 (Black, No charger, No earphone inbox)]]></Result>
<Result><![CDATA[Android Application Development (with Kitkat Support), Black Book]]></Result>
<Result><![CDATA[ZTE Blade Buzz White V815W]]></Result>
<Result><![CDATA[Android: App Development & Programming Guide: Learn In A Day! (Android, Rails, Ruby Programming, App Development...]]></Result>
<Result><![CDATA[]]></Result>
<Result><![CDATA[Karbonn Titanium S21 (Grey)]]></Result>
<Result><![CDATA[Head First Android Development]]></Result>
<Result><![CDATA[Micromax Canvas A1 Android One (White, 8GB)]]></Result>
<Result><![CDATA[Professional Android 4 Application Development (Wrox)]]></Result>
<Result><![CDATA[OnePlus X (Onyx) - Invite Only]]></Result>
<Result><![CDATA[Lenovo Vibe S1 (4G, White)]]></Result>
<Result><![CDATA[Micromax Bolt D320 (Black, 4GB)]]></Result>
<Result><![CDATA[2 in 1 Capacitive Stylus Pen With Black Ball Pen for Android Touch Sceen Mobile Phones and Tablets All iPads and...]]></Result>
<Result><![CDATA[Moto E 2nd Generation XT1506 (3G, Black)]]></Result>
<Result><![CDATA[Android: App Development & Programming Guide: Learn In A Day!]]></Result>
<Result><![CDATA[Lenovo Vibe S1 (4G, Dark Blue)]]></Result>
</SearchResults>
If you notice, fifth result is missing from the output for some reason, while it prints all other rows with same code. Essentially, SearchResult.find() method is returning NULL value only for one record.
Can you please let me know if I am missing something?
Thanks,
Nikhil

if you look at your link http://www.amazon.in/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=android, the 5th li element matches your criteria for class name s-result-item celwidget , which actually is Customers shopped for android in and does not completely match your second criteria of a-size-medium a-color-null s-inline s-access-title a-text-normal, which is causing Title to be set to none.
You can probably update your condition to below to print desired output.
if Title is not None:
ResultTag = "\t<Result><![CDATA["
ResultTag += Title.text
ResultTag += "]]></Result>"
print ResultTag

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python-Requests Scraping YouTube description with BS4 issue - python

I found a really simple way: for p in soup.find_all('p', id='eow-description'): print(p.get_text('\n')) Only issue now is that some of the links are stripped with ... You can also play around with youtube-dl python module to get the description of a youtube video that way as well.

I have found this way.. import pafy url='https://www.youtube.com/watch?v=aM7aW0G58CI' vid=pafy.new(url) print(vid.description) By this method, you will get your content in the exact same way as shown in Youtube's video description.

Related

Web-scraping: unable to extract the required text

Get Text from h1 with BeautifulSoup

Return empty bracket [ ] when web scraping

Difficulty using beautifulsoup in Python to scrape web data from multiple HTML classes

Missing value while parsing webpage with BeautifulSoup 4

Categories

Resources